Home > Big Data, Software Development > Java 8 New Features: Map Reduce Made Easy With Stream APIs

Java 8 New Features: Map Reduce Made Easy With Stream APIs

April 20th, 2014 Leave a comment Go to comments

In my article, I introduced the new Stream API. With the new stream APIs, you can apply many different operations on the stream, including the map-reduce functions.

One of the most famous framework to support map-reduce for large scale data processing, a.k.a. BigData, is Hadoop as I introduced almost two years ago here. Data processing wise, the Java 8 stream API can do pretty much the same. Here is a quick sample that shows how it count number of words in string. There are significant differences in how they are implemented and the cases in which they should be used. Let’s discuss them after the sample.

Lost VMs or Containers? Too Many Consoles? Too Slow GUI? Time to learn how to "Google" and manage your VMware and clouds in a fast and secure HTML5 App.

package org.doublecloud.jave8demo.stream;
 
import java.util.Arrays;
import java.util.List;
 
public class MapReduce
{
  public static void main(String[] args)
  {
    List<String> al = Arrays.asList( new String[] { 
      "This sample is by Steve from doublecloud.org, a leading ", 
      "technical blog on virtualization, cloud computing, and ",
      "software architecture." });
 
    int total = al.parallelStream().mapToInt(e -> e.split(" ").length).sum();
    System.out.println("Total words:" + total);
  }
}

As you can see, the above sample can do pretty much the same thing as the Hadoop sample, but in a single JVM. I use a string to make it simple. You can easily change it to read from a file or database. From programming perspective, the above code is much smaller and cleaner than Hadoop version, but it does not scale over one or more JVM neither.

Logically, Hadoop Map Reduce and Java Stream APIs can achieve similar goals, but they differ significantly in many ways. Hadoop includes many different components like HDFS, Map Reduce engine, Hive, PIG, etc, much more than the Java stream API which is a small part of Java standard APIs as of version 8. Scale wise, the Hadoop framework supports distributed processing with hundreds or thousands of machines (physical or virtual, but mostly physical) in a cluster. The Java stream API works in a single machine only. For the size of data that can be processed, Hadoop can handle much bigger size of data than Java 8 stream APIs can.

Having said these key differences, Hadoop and Java 8 stream APIs are actually compatible and can be used together. With each Hadoop node, you can probably leverage Java 8 stream APIs for better performance and cleaner code.

  1. January 14th, 2017 at 21:56 | #1

    Very good article. I’m going through some of these issues as well..

  1. No trackbacks yet.