While learning Hadoop, I was wondering whether the MapReduce processing model that can handle all the Big Data challenges. David DeWitt and Michael Stonebrakeer took a step further by arguing MapReduce is a major step backwards in their blog article. I found it’s a very good reading but not necessarily agree with the authors. It’s always good to know different opinions and the contexts where they come from. I also found the authors wrote the best introduction of MapReduce in several short paragraphs. I quote them in the end, so read on.
To support their point, David and Michael laid out 5 arguments that MapReduce is:
Lost VMs or Containers? Too Many Consoles? Too Slow GUI? Time to learn how to "Google" and manage your VMware and clouds in a fast and secure HTML5 App.
- A giant step backward in the programming paradigm for large-scale data intensive applications
- A sub-optimal implementation, in that it used brute force instead of indexing
- Not novel at all – it represents a specific implementation of well know techniques developed nearly 25 years ago
- Missing most of the features that are routinely included in current DBMS
- Incompatible with all of the tools DBMS users have come to depend on
I think most of the arguments are valid except the contexts which MapReduce was created in and used for. The MapReduce was created to handle the problems that were too big for DBMS to handle. I guess at that time the term big data may not have been invented. The processing model was not intended to compete, but rather complement, with DBMS, therefore feature comparison is not really applicable. Indexing, for example, is great but takes long time to create. If you just need to use index once, it doesn’t really save you any time by creating index first.
The authors actually pushed the community to think more and improve. The Hadoop community thereafter created new tools like Hive so that you can use SQL like tool to query data; HBase to save table data. This is an excellent case illustrating different opinions can help improve a new technology.
Now the best introduction of MapReduce processing model from the article:
The basic idea of MapReduce is straightforward. It consists of two programs that the user writes called map and reduce plus a framework for executing a possibly large number of instances of each program on a compute cluster.
The map program reads a set of “records” from an input file, does any desired filtering and/or transformations, and then outputs a set of records of the form (key, data). As the map program produces output records, a “split” function partitions the records into M disjoint buckets by applying a function to the key of each output record. This split function is typically a hash function, though any deterministic function will suffice. When a bucket fills, it is written to disk. The map program terminates with M output files, one for each bucket.
In general, there are multiple instances of the map program running on different nodes of a compute cluster. Each map instance is given a distinct portion of the input file by the MapReduce scheduler to process. If N nodes participate in the map phase, then there are M files on disk storage at each of N nodes, for a total of N * M files; Fi,j, 1 ≤ i ≤ N, 1 ≤ j ≤ M.
The key thing to observe is that all map instances use the same hash function. Hence, all output records with the same hash value will be in corresponding output files.
The second phase of a MapReduce job executes M instances of the reduce program, Rj, 1 ≤ j ≤ M. The input for each reduce instance Rj consists of the files Fi,j, 1 ≤ i ≤ N. Again notice that all output records from the map phase with the same hash value will be consumed by the same reduce instance — no matter which map instance produced them. After being collected by the map-reduce framework, the input records to a reduce instance are grouped on their keys (by sorting or hashing) and feed to the reduce program. Like the map program, the reduce program is an arbitrary computation in a general-purpose language. Hence, it can do anything it wants with its records. For example, it might compute some additional function over other data fields in the record. Each reduce instance can write records to an output file, which forms part of the “answer” to a MapReduce computation.
To draw an analogy to SQL, map is like the group-by clause of an aggregate query. Reduce is analogous to the aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute.