Category Archives: Big Data

Java 8 New Features: Map Reduce Made Easy With Stream APIs

In my article, I introduced the new Stream API. With the new stream APIs, you can apply many different operations on the stream, including the map-reduce functions.

One of the most famous framework to support map-reduce for large scale data processing, a.k.a. BigData, is Hadoop as I introduced almost two years ago here. Data processing wise, the Java 8 stream API can do pretty much the same. Here is a quick sample that shows how it count number of words in string. There are significant differences in how they are implemented and the cases in which they should be used. Let’s discuss them after the sample.

Also posted in Software Development | Tagged , , , | Leave a comment

What is Missing in the VMware and EMC’s Pivotal Initiative?

Last week VMware formally announced that it would form a virtual team with EMC to take cloud service and middleware market. There was a rumor about it the week early which turned out to be mostly true. If you are in IT industry nowadays, you simply cannot under-estimate the power of rumors. I think most of the VMware and EMC employees might hear the rumor before hearing it from their management teams. :-)

Also posted in Cloud Computing | Tagged , , , | 9 Responses

Big Data or Big Junk?

Two weeks I got a problem with my blog site. Somehow I could not post to announce the GA of open source ViJava API for vSphere 5.1 there. After searching and researching, I found out that the wp_commentmeta table was filled with extra amount of data that exceeded the per database limit of 100MB imposed by my service provider. While I was enjoying Thanksgiving holiday, some spammers and their robots worked diligently posting thousands of spam comments on my site.

Posted in Big Data | Tagged , | 13 Responses

Hadoop MapReduce Data Flow

As its name suggests, the Hadoop MapReduce include Map and Reduce in its processing data flow. At its highest level, the MapReduce follows the traditional wisdom “Divide and Conquer” – dividing big data to small data that can be processed by a commodity computer and then pulling the results together.

Posted in Big Data | Tagged , | 6 Responses

Hadoop File System APIs

As mentioned in my previous post on Hadoop File System commands, the commands are built on top of the HDFS APIs. These APIs are defined in the org.apache.hadoop.fs package, including several interfaces and over 20 classes, enums, and exceptions (the number of interfaces and classes varied from release to release).

As always, it’s best to start with a sample code while learning new APIs. The following sample copies a file from local file system to HDFS.

Posted in Big Data | Tagged , , | 3 Responses

Hadoop File System Commands

I just took a Hadoop developer training in the week of September 10. To me, Hadoop is not totally new as I’ve tried HelloWorld sample and Serengeti project. Still, I found it’s nice to get away from daily job and go through a series of lectures and hands-on labs in a training setting. Believe it or not, I felt more tired after training than a typical working day. This post is not much new but just helps me on the commands when needed later.

Posted in Big Data | Tagged , , | 5 Responses

My First Try of Hadoop Azure

During the breaks of my vacation last week, I tried the Technology Preview for the Apache Hadoop-based Service on Windows Azure. The service is not yet publicly available and requiring Microsoft approval. Here is the link that I used to file my application. It took several days for me to get the email with invitation code. Sorry that I cannot include the code here. :-)

Also posted in Cloud Computing | Tagged , | 1 Response

Big Data: How Big is Big?

I came across a video on Youtube over the past weekend: Big Ideas: How Big is Big Data. Although coming with several mentions of EMC, it’s very well prepared and demonstrated with white-boarding, therefore worthwhile to share here.

Some of the key points made from the video include:

  • The growth is accelerating. By 2020, there will be 50x more data than today.
Posted in Big Data | Tagged | 2 Responses

Hadoop vs. Tomcat

In my previous article, I talked about three different ways enterprises use Hadoop. Thinking a bit more, you may have come to realize that the three usage patterns are very similar to how we use Tomcat. I will compare these two for commonalities and differences.

First of all, both Hadoop and Tomcat are Java based open source projects from Apache Foundation, thus copyrighted by the same Apache license. As a result, you can freely use Hadoop in the same way as you have used Tomcat in terms of license compliance.

Posted in Big Data | Tagged , , | 2 Responses

VMware Serengeti: A Perfect Match of Hadoop and vSphere

During the Hadoop Summit 2012 last month, I learned the release of the open source (Apache license) Serengeti project from VMware. The week after, I downloaded the OVA file from VMware site, and gave my first try with a development environment after browsing through the user guide which introduces a fairly easy process to get a Hadoop cluster to run on vSphere.

Also posted in Virtualization | Tagged , | 1 Response

Three Ways Enterprises Can Use Hadoop

Hadoop has recently gained lots of attentions from enterprises. Just think about the rapid growth of attendees in Hadoop Summit. There are many different ways to leverage Hadoop in enterprises. But in general, there are three major types of usage patterns as detailed below.

As a Framework

This is what Hadoop was initially intended to be, and continues to be one of the major approaches in the short term. It means that an enterprise needs to invest in customized application development, which normally costs more than out of shelf applications.

Posted in Big Data | Tagged , | 1 Response

What Hadoop Community Can Learn From VMware Virtualization

As I mentioned in a previous article, Hadoop is in a similar stage as virtualization 10 years ago – the technology is mostly ready for wider adoption. There were certain secret sauces leading to virtualization’s stellar success, especially VMware in the enterprise space. Here I examine some of these success factors that could be learned by Hadoop community.

Strive For Out Of Box Experience

Posted in Big Data | Tagged , | Leave a comment

Is MapReduce A Major Step Backwards?

While learning Hadoop, I was wondering whether the MapReduce processing model that can handle all the Big Data challenges. David DeWitt and Michael Stonebrakeer took a step further by arguing MapReduce is a major step backwards in their blog article. I found it’s a very good reading but not necessarily agree with the authors. It’s always good to know different opinions and the contexts where they come from. I also found the authors wrote the best introduction of MapReduce in several short paragraphs. I quote them in the end, so read on.

Posted in Big Data | Tagged , | 3 Responses

MapReduce: The Theory Behind Hadoop

As most of us know, Hadoop is a Java implementation of the MapReduce processing model originated from Google by Jeffrey Dean and Sanjay Ghemawat. After studying Hadoop and attending several related events(Hadoop Summit, Hadoop for Enterprise by Churchill Club), I felt I should dig deeper by reading the original paper.

The paper is titled “MapReduce: Simplified Data Processing on Large Clusters.” Unlike most research papers I’ve read before, it’s written in plain English and fairly easy to read and follow. I find it really worthwhile reading and strongly recommend you spend an hour to read through it.

Posted in Big Data | Tagged , | 2 Responses

GPU for Big Data Processing

While talking about the data processing, we naturally take CPU for granted. However, latest GPU (Graphics Processing Unit, also know as Visual Processing Unit, or VPU) comes with hundreds of cores and calculates much faster than CPU. The question is how practical it is to use GPUs in processing big data.

Posted in Big Data | Tagged , , , | 5 Responses

GUI Front End for Hadoop

I went to LinkedIn last Wednesday for a tech talk by UC Berkeley professor Joseph Hellerstein on Programming for Distributed Consistency: CALM and Bloom. This is indeed a highly specialized topic, so I am not going to talk about the details. Should you be interested in the new programming language Bloom, you can check the web site (http://bloom-lang.org).

Posted in Big Data | Tagged , | 1 Response

Hadoop Summit 2012: A Quick Summary

After the Churchill event on Hadoop for enterprises, I attended the Hadoop Summit in San Jose convention center. It’s one of the benefits living in Silicon Valley that I can attend various tech events without flying away from family for days.

Also posted in News & Events | Tagged , | 7 Responses

Getting started with Hadoop: My First Try

Given the growing popularity of Hadoop, I decided to give it a try by myself. As normal, I searched for a tutorial first and got one by Yahoo, which is based on Hadoop 0.18.0 virtual machine. I knew the current stable version is 1.x, but that is OK because I just wanted to get a big picture and I didn’t want to refuse the convenience of ready-to-use Hadoop virtual machine.

Also posted in Software Development | Tagged , , | 5 Responses

Hadoop For Enterprises: Event By Churchill Club

This past week was a busy one for Hadoop community with two Hadoop events in Silicon Valley. The first one was “what role will hadoop play in the enterprise” by Churchill Club which attracted about 300 attendees in a Palo Alto hotel. The second one was the much bigger conference Hadoop Summit in San Jose Convention Center. I will write a separate article on the second event soon.

Posted in Big Data | Tagged , , , | 1 Response
  • NEED HELP?


    My company has created products like vSearch ("Super vCenter"), vijavaNG APIs, EAM APIs, ICE tool. We also help clients with virtualization and cloud computing on customized development, training. Should you, or someone you know, need these products and services, please feel free to contact me: steve __AT__ doublecloud.org.

    Me: Steve Jin, VMware vExpert who authored the VMware VI and vSphere SDK by Prentice Hall, and created the de factor open source vSphere Java API while working at VMware engineering. Companies like Cisco, EMC, NetApp, HP, Dell, VMware, are among the users of the API and other tools I developed for their products, internal IT orchestration, and test automation.