One of the most famous framework to support map-reduce for large scale data processing, a.k.a. BigData, is Hadoop as I introduced almost two years ago here. Data processing wise, the Java 8 stream API can do pretty much the same. Here is a quick sample that shows how it count number of words in string. There are significant differences in how they are implemented and the cases in which they should be used. Let’s discuss them after the sample.
As its name suggests, the Hadoop MapReduce include Map and Reduce in its processing data flow. At its highest level, the MapReduce follows the traditional wisdom “Divide and Conquer” – dividing big data to small data that can be processed by a commodity computer and then pulling the results together.
While learning Hadoop, I was wondering whether the MapReduce processing model that can handle all the Big Data challenges. David DeWitt and Michael Stonebrakeer took a step further by arguing MapReduce is a major step backwards in their blog article. I found it’s a very good reading but not necessarily agree with the authors. It’s always good to know different opinions and the contexts where they come from. I also found the authors wrote the best introduction of MapReduce in several short paragraphs. I quote them in the end, so read on.
As most of us know, Hadoop is a Java implementation of the MapReduce processing model originated from Google by Jeffrey Dean and Sanjay Ghemawat. After studying Hadoop and attending several related events(Hadoop Summit, Hadoop for Enterprise by Churchill Club), I felt I should dig deeper by reading the original paper.
The paper is titled “MapReduce: Simplified Data Processing on Large Clusters.” Unlike most research papers I’ve read before, it’s written in plain English and fairly easy to read and follow. I find it really worthwhile reading and strongly recommend you spend an hour to read through it.
While talking about the data processing, we naturally take CPU for granted. However, latest GPU (Graphics Processing Unit, also know as Visual Processing Unit, or VPU) comes with hundreds of cores and calculates much faster than CPU. The question is how practical it is to use GPUs in processing big data.