As its name suggests, the Hadoop MapReduce include Map and Reduce in its processing data flow. At its highest level, the MapReduce follows the traditional wisdom “Divide and Conquer” – dividing big data to small data that can be processed by a commodity computer and then pulling the results together.
If we look closely at the detailed work flow on how big data is processed in Hadoop, we’ll find many more stages. This article explains what steps are involved in Hadoop MapReduce data flow. For most of use cases, developers only need to write customized code for the map and reduce and let the Hadoop framework takes care of the rest.
Bothered by SLOW Web UI to manage vSphere? Want to manage ALL your VMware vCenters, AWS, Azure, Openstack, container behind a SINGLE pane of glass? Want to search, analyze, report, visualize VMs, hosts, networks, datastores, events as easily as Google the Web? Find out more about vSearch 3.0: the search engine for all your private and public clouds.
- InputFormat. This first step reads in big data via HDFS APIs and chops it into many key/value pairs. For different types of source data, Hadoop provides built-in processing implementations, like TextInputFormat (default), KeyValueInputFormat, SequenceFileInputFormat, DBInputFormat. The TextInputFormat is the default implementation. As its name suggests, it’s targeted for text files, and chop a text file per line basis, emitting pairs with the key as position of a line in the text file and the value as the text in the line.
- Map. This step looks at the individual from first step and processes it per user supplied logic. As this step, new sets of are generated. In the famous count word sample, the mapper gets in something like <10, “test is a test”> and maps it to <”test”, 1>, <”is”, 1>, <”a”, 1>, <”test”, 1>.
- Combine. This optional step is known as a map-side “mini-reducer.” It can reduce the amount of data to be shuffled across the wire by combing some of the pairs. If applied in the pairs from the sample in step 2, the two “test” records can be combined as one .
As it’s essentially a reduce but happens to run on map side, it has to implement the Reducer interface without changing the values of keys. It can be run multiple iterations.
- Partition. It allocates data across the number of partitions developers specify. It depends on the hash value to determine the following reducers. If you use Java built-in data types as keys, the hashed values of the keys should be evenly distributed. But if you have un-evenly distributed key values, you may end up with un-balanced workload across reducers. Then you have to customize partitioner for workload balance. To do that, you got to know the distribution pattern of your key space and design a new hashing scheme.
- Transport or Shuffle. This step involves serializing data and send them to corresponding reduce node with HTTP protocol. This is quite mechanical in nature and most taken care of by the Hadoop framework itself.
- Sort. This sorts the key/value pairs for the reducers, and helps to merge Reduce inputs.
- Reduce. It receives output from many mappers, and consolidates values for common intermediate keys.
- OutputFormat. This is the last step of the processing. It writes out the reduced results to file named like “part-xxxxx” in Hadoop File System.
You may notice that the processing of Hadoop MapReduce is not that straight forward for most programmers, even less so for traditional data base developers who tend to think about data processing in SQL style. It’s not easy to get on MapReduce and do processing like querying, filtering, etc. in SQL. To make it easy, you want to use Hive.
Carter Shanklin, who is now a director of product management at HortonWorks and whom you want to follow if you are interested in what’s going on in Hadoop community, also pointed me to the Cascading framework. According to its Web site:
Cascading is a Java application framework that enables typical developers to quickly and easily develop rich Data Analytics and Data Management applications that can be deployed and managed across a variety of computing environments. Cascading works seamlessly with Apache Hadoop 1.0 and API compatible distributions.
As I browsed its WordCount sample, it’s more tuned to data pipeline concept which makes it easier than MapReduce. It’s also more verbose than what I expected. Maybe its value becomes more obvious with more complicated processing in real word than the simple WordCount sample.
BTW, as I am still learning Hadoop on and off, there could be misunderstandings in the details of the workflow. If you catch one, please feel free to comment. Let’s make it an interactive learning experience.