Is MapReduce A Major Step Backwards?

While learning Hadoop, I was wondering whether the MapReduce processing model that can handle all the Big Data challenges. David DeWitt and Michael Stonebrakeer took a step further by arguing MapReduce is a major step backwards in their blog article. I found it’s a very good reading but not necessarily agree with the authors. It’s always good to know different opinions and the contexts where they come from. I also found the authors wrote the best introduction of MapReduce in several short paragraphs. I quote them in the end, so read on.

To support their point, David and Michael laid out 5 arguments that MapReduce is:

Bothered by SLOW Web UI to manage vSphere? Want to manage ALL your VMware vCenters, AWS, Azure, Openstack, container behind a SINGLE pane of glass? Want to search, analyze, report, visualize VMs, hosts, networks, datastores, events as easily as Google the Web? Find out more about vSearch 3.0: the search engine for all your private and public clouds.

  1. A giant step backward in the programming paradigm for large-scale data intensive applications
  2. A sub-optimal implementation, in that it used brute force instead of indexing
  3. Not novel at all – it represents a specific implementation of well know techniques developed nearly 25 years ago
  4. Missing most of the features that are routinely included in current DBMS
  5. Incompatible with all of the tools DBMS users have come to depend on

I think most of the arguments are valid except the contexts which MapReduce was created in and used for. The MapReduce was created to handle the problems that were too big for DBMS to handle. I guess at that time the term big data may not have been invented. The processing model was not intended to compete, but rather complement, with DBMS, therefore feature comparison is not really applicable. Indexing, for example, is great but takes long time to create. If you just need to use index once, it doesn’t really save you any time by creating index first.

The authors actually pushed the community to think more and improve. The Hadoop community thereafter created new tools like Hive so that you can use SQL like tool to query data; HBase to save table data. This is an excellent case illustrating different opinions can help improve a new technology.

Now the best introduction of MapReduce processing model from the article:

The basic idea of MapReduce is straightforward. It consists of two programs that the user writes called map and reduce plus a framework for executing a possibly large number of instances of each program on a compute cluster.

The map program reads a set of “records” from an input file, does any desired filtering and/or transformations, and then outputs a set of records of the form (key, data). As the map program produces output records, a “split” function partitions the records into M disjoint buckets by applying a function to the key of each output record.   This split function is typically a hash function, though any deterministic function will suffice. When a bucket fills, it is written to disk. The map program terminates with M output files, one for each bucket.

In general, there are multiple instances of the map program running on different nodes of a compute cluster. Each map instance is given a distinct portion of the input file by the MapReduce scheduler to process. If N nodes participate in the map phase, then there are M files on disk storage at each of N nodes, for a total of N * M files; Fi,j,  1 ≤ i ≤ N,  1 ≤ j ≤ M.

The key thing to observe is that all map instances use the same hash function. Hence, all output records with the same hash value will be in corresponding output files.

The second phase of a MapReduce job executes M instances of the reduce program, Rj, 1 ≤ j ≤ M.  The input for each reduce instance Rj consists of the files Fi,j,  1 ≤ i ≤ N.  Again notice that all output records from the map phase with the same hash value will be consumed by the same reduce instance — no matter which map instance produced them. After being collected by the map-reduce framework, the input records to a reduce instance are grouped on their keys (by sorting or hashing) and feed to the reduce program. Like the map program, the reduce program is an arbitrary computation in a general-purpose language. Hence, it can do anything it wants with its records. For example, it might compute some additional function over other data fields in the record. Each reduce instance can write records to an output file, which forms part of the “answer” to a MapReduce computation.

To draw an analogy to SQL, map is like the group-by clause of an aggregate query. Reduce is analogous to the aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute.

This entry was posted in Big Data and tagged , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

3 Comments

  1. Posted July 9, 2012 at 3:36 pm | Permalink

    The two biggest complaints I’ve heard about MapReduce and Hadoop in general are

    – Lack of talent – smaller companies cannot compete with the likes of Yahoo, and others that are using Mapreduce – lack of talent leads of course to sub optimal configurations and bad design/poor performance. vs SQL has a much lower barrier to entry and easier to find talent for. Of course regular SQL servers don’t scale quite as wide as something like hadoop but there are offerings that offer extreme performance.
    – Not real time. Batch jobs like map reduce by nature aren’t real time. It seems as time goes on people want more and more real time stuff. This is the main reason Google (arguably be described as the modern map reduce champion) abandoned map reduce several years ago because they couldn’t get data out of the thing fast enough even with their vast farms of PhDs optimizing their code and their unlimited number of servers.

    But if your not in a hurry to get the results(when results can be measured in hours+ vs seconds (or subsecond) or minutes depending on data set) then hadoop can be pretty good solution for many things.

    It is purpose built and folks must keep that in mind. One of my former VPs was wanting to use HDFS as a more general purpose file system and store things like backups and stuff there. Yeah that’ll work real well. They also wanted to try to stream data in real time into HDFS, a few months later they told me they tried it(long after I left the company) and “it sort of crashed”. HA HA.

    Use it for what it’s good for..

  2. Posted July 11, 2012 at 1:24 pm | Permalink

    Hi Steve – This is a great read (trying to absorb the paper on Dremel now)

    http://gigaom.com/cloud/why-the-days-are-numbered-for-hadoop-as-we-know-it/

  3. Posted July 11, 2012 at 2:54 pm | Permalink

    Thanks for sharing this, just browsed through it. It’s pretty helpful, especially the latest updates from Google.
    Steve

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

  • NEED HELP?


    My company has created products like vSearch ("Super vCenter"), vijavaNG APIs, EAM APIs, ICE tool. We also help clients with virtualization and cloud computing on customized development, training. Should you, or someone you know, need these products and services, please feel free to contact me: steve __AT__ doublecloud.org.

    Me: Steve Jin, VMware vExpert who authored the VMware VI and vSphere SDK by Prentice Hall, and created the de factor open source vSphere Java API while working at VMware engineering. Companies like Cisco, EMC, NetApp, HP, Dell, VMware, are among the users of the API and other tools I developed for their products, internal IT orchestration, and test automation.