Hadoop MapReduce Data Flow

As its name suggests, the Hadoop MapReduce include Map and Reduce in its processing data flow. At its highest level, the MapReduce follows the traditional wisdom “Divide and Conquer” – dividing big data to small data that can be processed by a commodity computer and then pulling the results together.

If we look closely at the detailed work flow on how big data is processed in Hadoop, we’ll find many more stages. This article explains what steps are involved in Hadoop MapReduce data flow. For most of use cases, developers only need to write customized code for the map and reduce and let the Hadoop framework takes care of the rest.

Bothered by SLOW Web UI to manage vSphere? Want to manage ALL your VMware vCenters, AWS, Azure, Openstack, container behind a SINGLE pane of glass? Want to search, analyze, report, visualize VMs, hosts, networks, datastores, events as easily as Google the Web? Find out more about vSearch 3.0: the search engine for all your private and public clouds.

  1. InputFormat. This first step reads in big data via HDFS APIs and chops it into many key/value pairs. For different types of source data, Hadoop provides built-in processing implementations, like TextInputFormat (default), KeyValueInputFormat, SequenceFileInputFormat, DBInputFormat. The TextInputFormat is the default implementation. As its name suggests, it’s targeted for text files, and chop a text file per line basis, emitting pairs with the key as position of a line in the text file and the value as the text in the line.
  2. Map. This step looks at the individual from first step and processes it per user supplied logic. As this step, new sets of are generated. In the famous count word sample, the mapper gets in something like <10, “test is a test”> and maps it to <”test”, 1>, <”is”, 1>, <”a”, 1>, <”test”, 1>.
  3. Combine. This optional step is known as a map-side “mini-reducer.” It can reduce the amount of data to be shuffled across the wire by combing some of the pairs. If applied in the pairs from the sample in step 2, the two “test” records can be combined as one .
    As it’s essentially a reduce but happens to run on map side, it has to implement the Reducer interface without changing the values of keys. It can be run multiple iterations.
  4. Partition. It allocates data across the number of partitions developers specify. It depends on the hash value to determine the following reducers. If you use Java built-in data types as keys, the hashed values of the keys should be evenly distributed. But if you have un-evenly distributed key values, you may end up with un-balanced workload across reducers. Then you have to customize partitioner for workload balance. To do that, you got to know the distribution pattern of your key space and design a new hashing scheme.
  5. Transport or Shuffle. This step involves serializing data and send them to corresponding reduce node with HTTP protocol. This is quite mechanical in nature and most taken care of by the Hadoop framework itself.
  6. Sort. This sorts the key/value pairs for the reducers, and helps to merge Reduce inputs.
  7. Reduce. It receives output from many mappers, and consolidates values for common intermediate keys.
  8. OutputFormat. This is the last step of the processing. It writes out the reduced results to file named like “part-xxxxx” in Hadoop File System.

You may notice that the processing of Hadoop MapReduce is not that straight forward for most programmers, even less so for traditional data base developers who tend to think about data processing in SQL style. It’s not easy to get on MapReduce and do processing like querying, filtering, etc. in SQL. To make it easy, you want to use Hive.

Carter Shanklin, who is now a director of product management at HortonWorks and whom you want to follow if you are interested in what’s going on in Hadoop community, also pointed me to the Cascading framework. According to its Web site:

Cascading is a Java application framework that enables typical developers to quickly and easily develop rich Data Analytics and Data Management applications that can be deployed and managed across a variety of computing environments. Cascading works seamlessly with Apache Hadoop 1.0 and API compatible distributions.

As I browsed its WordCount sample, it’s more tuned to data pipeline concept which makes it easier than MapReduce. It’s also more verbose than what I expected. Maybe its value becomes more obvious with more complicated processing in real word than the simple WordCount sample.

BTW, as I am still learning Hadoop on and off, there could be misunderstandings in the details of the workflow. If you catch one, please feel free to comment. Let’s make it an interactive learning experience.

This entry was posted in Big Data and tagged , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.


  1. Posted October 22, 2012 at 12:42 am | Permalink

    Hadoop MapReduce Data Flow (DoubleCloud) http://t.co/2LnRH48t

  2. Posted October 22, 2012 at 2:25 am | Permalink

    Hadoop MapReduce Data Flow (DoubleCloud) http://t.co/uzbqB1t3

  3. Posted October 22, 2012 at 9:31 am | Permalink

    Hadoop MapReduce Data Flow http://t.co/rgOK1dKH #MapReduce

  4. Marco Shaw
    Posted October 22, 2012 at 4:46 pm | Permalink

    “Cater Shanklin”… That would be “Carter Shanklin”

  5. Posted October 22, 2012 at 5:54 pm | Permalink

    Thanks Marco, my bad. Fixed it. Steve

One Trackback

  • By Tofa IT » Hadoop MapReduce Data Flow on October 22, 2012 at 10:19 am

    […] by a commodity computer and then pulling the results together. If we look closely at the […]Hadoop MapReduce Data Flow originally appeared on DoubleCloud by Steve Jin, author of VMware VI and vSphere SDK (Prentice […]

Post a Comment

Your email is never published nor shared. Required fields are marked *


You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>


    My company has created products like vSearch ("Super vCenter"), vijavaNG APIs, EAM APIs, ICE tool. We also help clients with virtualization and cloud computing on customized development, training. Should you, or someone you know, need these products and services, please feel free to contact me: steve __AT__ doublecloud.org.

    Me: Steve Jin, VMware vExpert who authored the VMware VI and vSphere SDK by Prentice Hall, and created the de factor open source vSphere Java API while working at VMware engineering. Companies like Cisco, EMC, NetApp, HP, Dell, VMware, are among the users of the API and other tools I developed for their products, internal IT orchestration, and test automation.