Home > Big Data > Is MapReduce A Major Step Backwards?

Is MapReduce A Major Step Backwards?

While learning Hadoop, I was wondering whether the MapReduce processing model that can handle all the Big Data challenges. David DeWitt and Michael Stonebrakeer took a step further by arguing MapReduce is a major step backwards in their blog article. I found it’s a very good reading but not necessarily agree with the authors. It’s always good to know different opinions and the contexts where they come from. I also found the authors wrote the best introduction of MapReduce in several short paragraphs. I quote them in the end, so read on.

To support their point, David and Michael laid out 5 arguments that MapReduce is:

Lost VMs or Containers? Too Many Consoles? Too Slow GUI? Time to learn how to "Google" and manage your VMware and clouds in a fast and secure HTML5 App.

  1. A giant step backward in the programming paradigm for large-scale data intensive applications
  2. A sub-optimal implementation, in that it used brute force instead of indexing
  3. Not novel at all – it represents a specific implementation of well know techniques developed nearly 25 years ago
  4. Missing most of the features that are routinely included in current DBMS
  5. Incompatible with all of the tools DBMS users have come to depend on

I think most of the arguments are valid except the contexts which MapReduce was created in and used for. The MapReduce was created to handle the problems that were too big for DBMS to handle. I guess at that time the term big data may not have been invented. The processing model was not intended to compete, but rather complement, with DBMS, therefore feature comparison is not really applicable. Indexing, for example, is great but takes long time to create. If you just need to use index once, it doesn’t really save you any time by creating index first.

The authors actually pushed the community to think more and improve. The Hadoop community thereafter created new tools like Hive so that you can use SQL like tool to query data; HBase to save table data. This is an excellent case illustrating different opinions can help improve a new technology.

Now the best introduction of MapReduce processing model from the article:

The basic idea of MapReduce is straightforward. It consists of two programs that the user writes called map and reduce plus a framework for executing a possibly large number of instances of each program on a compute cluster.

The map program reads a set of “records” from an input file, does any desired filtering and/or transformations, and then outputs a set of records of the form (key, data). As the map program produces output records, a “split” function partitions the records into M disjoint buckets by applying a function to the key of each output record.   This split function is typically a hash function, though any deterministic function will suffice. When a bucket fills, it is written to disk. The map program terminates with M output files, one for each bucket.

In general, there are multiple instances of the map program running on different nodes of a compute cluster. Each map instance is given a distinct portion of the input file by the MapReduce scheduler to process. If N nodes participate in the map phase, then there are M files on disk storage at each of N nodes, for a total of N * M files; Fi,j,  1 ≤ i ≤ N,  1 ≤ j ≤ M.

The key thing to observe is that all map instances use the same hash function. Hence, all output records with the same hash value will be in corresponding output files.

The second phase of a MapReduce job executes M instances of the reduce program, Rj, 1 ≤ j ≤ M.  The input for each reduce instance Rj consists of the files Fi,j,  1 ≤ i ≤ N.  Again notice that all output records from the map phase with the same hash value will be consumed by the same reduce instance — no matter which map instance produced them. After being collected by the map-reduce framework, the input records to a reduce instance are grouped on their keys (by sorting or hashing) and feed to the reduce program. Like the map program, the reduce program is an arbitrary computation in a general-purpose language. Hence, it can do anything it wants with its records. For example, it might compute some additional function over other data fields in the record. Each reduce instance can write records to an output file, which forms part of the “answer” to a MapReduce computation.

To draw an analogy to SQL, map is like the group-by clause of an aggregate query. Reduce is analogous to the aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute.

Categories: Big Data Tags: ,
  1. July 9th, 2012 at 15:36 | #1

    The two biggest complaints I’ve heard about MapReduce and Hadoop in general are

    – Lack of talent – smaller companies cannot compete with the likes of Yahoo, and others that are using Mapreduce – lack of talent leads of course to sub optimal configurations and bad design/poor performance. vs SQL has a much lower barrier to entry and easier to find talent for. Of course regular SQL servers don’t scale quite as wide as something like hadoop but there are offerings that offer extreme performance.
    – Not real time. Batch jobs like map reduce by nature aren’t real time. It seems as time goes on people want more and more real time stuff. This is the main reason Google (arguably be described as the modern map reduce champion) abandoned map reduce several years ago because they couldn’t get data out of the thing fast enough even with their vast farms of PhDs optimizing their code and their unlimited number of servers.

    But if your not in a hurry to get the results(when results can be measured in hours+ vs seconds (or subsecond) or minutes depending on data set) then hadoop can be pretty good solution for many things.

    It is purpose built and folks must keep that in mind. One of my former VPs was wanting to use HDFS as a more general purpose file system and store things like backups and stuff there. Yeah that’ll work real well. They also wanted to try to stream data in real time into HDFS, a few months later they told me they tried it(long after I left the company) and “it sort of crashed”. HA HA.

    Use it for what it’s good for..

  2. July 11th, 2012 at 13:24 | #2

    Hi Steve – This is a great read (trying to absorb the paper on Dremel now)

    http://gigaom.com/cloud/why-the-days-are-numbered-for-hadoop-as-we-know-it/

  3. July 11th, 2012 at 14:54 | #3

    Thanks for sharing this, just browsed through it. It’s pretty helpful, especially the latest updates from Google.
    Steve

  1. No trackbacks yet.