Java 8 New Features: Map Reduce Made Easy With Stream APIs

In my article, I introduced the new Stream API. With the new stream APIs, you can apply many different operations on the stream, including the map-reduce functions.

One of the most famous framework to support map-reduce for large scale data processing, a.k.a. BigData, is Hadoop as I introduced almost two years ago here. Data processing wise, the Java 8 stream API can do pretty much the same. Here is a quick sample that shows how it count number of words in string. There are significant differences in how they are implemented and the cases in which they should be used. Let’s discuss them after the sample.

Lost VMs or Containers? Too Many Consoles? Too Slow GUI? Time to learn how to "Google" and manage your VMware and clouds in a fast and secure HTML5 App.

import java.util.Arrays;
import java.util.List;
public class MapReduce
  public static void main(String[] args)
    List<String> al = Arrays.asList( new String[] { 
      "This sample is by Steve from, a leading ", 
      "technical blog on virtualization, cloud computing, and ",
      "software architecture." });
    int total = al.parallelStream().mapToInt(e -> e.split(" ").length).sum();
    System.out.println("Total words:" + total);

As you can see, the above sample can do pretty much the same thing as the Hadoop sample, but in a single JVM. I use a string to make it simple. You can easily change it to read from a file or database. From programming perspective, the above code is much smaller and cleaner than Hadoop version, but it does not scale over one or more JVM neither.

Logically, Hadoop Map Reduce and Java Stream APIs can achieve similar goals, but they differ significantly in many ways. Hadoop includes many different components like HDFS, Map Reduce engine, Hive, PIG, etc, much more than the Java stream API which is a small part of Java standard APIs as of version 8. Scale wise, the Hadoop framework supports distributed processing with hundreds or thousands of machines (physical or virtual, but mostly physical) in a cluster. The Java stream API works in a single machine only. For the size of data that can be processed, Hadoop can handle much bigger size of data than Java 8 stream APIs can.

Having said these key differences, Hadoop and Java 8 stream APIs are actually compatible and can be used together. With each Hadoop node, you can probably leverage Java 8 stream APIs for better performance and cleaner code.

This entry was posted in Big Data, Software Development and tagged , , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

One Comment

  1. Posted January 14, 2017 at 9:56 pm | Permalink

    Very good article. I’m going through some of these issues as well..

Post a Comment

Your email is never published nor shared. Required fields are marked *


You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>


    My company has created products like vSearch ("Super vCenter"), vijavaNG APIs, EAM APIs, ICE tool. We also help clients with virtualization and cloud computing on customized development, training. Should you, or someone you know, need these products and services, please feel free to contact me: steve __AT__

    Me: Steve Jin, VMware vExpert who authored the VMware VI and vSphere SDK by Prentice Hall, and created the de factor open source vSphere Java API while working at VMware engineering. Companies like Cisco, EMC, NetApp, HP, Dell, VMware, are among the users of the API and other tools I developed for their products, internal IT orchestration, and test automation.