MapReduce: The Theory Behind Hadoop
As most of us know, Hadoop is a Java implementation of the MapReduce processing model originated from Google by Jeffrey Dean and Sanjay Ghemawat. After studying Hadoop and attending several related events(Hadoop Summit, Hadoop for Enterprise by Churchill Club), I felt I should dig deeper by reading the original paper.
The paper is titled “MapReduce: Simplified Data Processing on Large Clusters.” Unlike most research papers I’ve read before, it’s written in plain English and fairly easy to read and follow. I find it really worthwhile reading and strongly recommend you spend an hour to read through it.
According to the authors, the idea was “inspired by the map and reduce primitives present in Lisp and many other functional languages.” They found “most of our computations involved applying a map operation to each logical ‘record’ in our input in order to compute a set of immediate key/value pairs, and then applying reduce operation to all the values that shared the same key, in order to combine the derived data appropriately.”
Despite the conceptually straightforward computations, “the input data is usually large and the computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. The issues of how to parallelize the computation, distribute the data, and handle failures conspire to obscure the original simple computation with large amount of complex code to deal with these issues.”
As you can see, this is a perfect case where a framework can help – abstract all the common code that handles the distributed computing to the framework, and leave the simple processing to the framework users who are also developers and write simple map and reduce functions. Combined together is a powerful yet simple solution to the problems the authors faced at Google, for example, large-scale indexing.
The authors “wrote the first version of MapReduce library in February of 2003 and made significant enhancements to it in August of 2003.” The MapReduce library was implemented in C++, which runs more efficient than Java that is used by Hadoop. It makes sense in large scale clusters where any percentage improvement in performance can save hundreds of physical machines. In the Google environment of 2003, the machines used were dual-processor x86 processors running Linux with 2 to 4 GB of memory and directly attached IDE disks, with 100Mbps or 1Gbps network. After almost 10 years passed, the typical machines are almost 100 times more powerful per Moore’s law.
Interestingly, after the first paper was published in 2004, there hasn’t been much news from Google on its MapReduce project. The MapReduce idea was picked up by Yahoo and re-implemented in Java with a cool name called Hadoop which was open sourced and became very popular than the originator.
According to a gentleman I met at Hadoop Summit this year, Google’s MapReduce framework is still several years ahead of Hadoop today. I fully believe that because Google has enough use cases and challenges, and of course talented engineers, to push the framework forward.
What wondered me, though, is why Google didn’t open-sourced its MapReduce framework. Maybe it’s the core competence of Google business? It’s another topic that deserves its own article.