Home > Big Data, Software Development > Getting started with Hadoop: My First Try

Getting started with Hadoop: My First Try

Given the growing popularity of Hadoop, I decided to give it a try by myself. As normal, I searched for a tutorial first and got one by Yahoo, which is based on Hadoop 0.18.0 virtual machine. I knew the current stable version is 1.x, but that is OK because I just wanted to get a big picture and I didn’t want to refuse the convenience of ready-to-use Hadoop virtual machine.

The tutorial is not that long so I just tried to walk through it. Because I’ve have Java and Eclipse set up, so I just downloaded the Hadoop virtual machine and ran it on VMware Player. Then I got stuck because the Eclipse plug-in required in the tutorial could not be found – I didn’t have the CD mentioned in the tutorial. It took me a while but I found a newer version of the plug-in.

Time to learn how to "Google" and manage your VMware and clouds in a fast and secure

HTML5 App

After installing the plug-in, I could add a new Hadoop location in the Map/Reduce Locations view. The Hadoop location also showed up in the Eclipse Project Explorer view under the DFS Locations root node, but when it’s expanded I got error node “Error: null.” Later one I found out that the command line can do most works.

Then came the WordCount sample code which was the fun part for me. Before that, I copied the hadoop-0.18.0 directory under the hadoop-user home directory to the machine where my Eclipse runs. I then created a new project using the MapReduce project wizard (coming with Hadoop plug-in) and specify Hadoop library location there. The Hadoop plug-in simply adds all the required libraries (jar files) in Java build path so you don’t need to worry about them. If you don’t have Hadoop plug-in installed, you can manually add them, the most important one of which is the hadoop-0.18.0-core.jar.

After the project is created, I typed in the source code from the tutorial. Somehow it didn’t compile right away, I had to search around and found a similar code from Cloudera Hadoop tutorial.

With a few tweaks, the application compiled. The following are the three java files:

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;


public class WordCount 
{

	public static void main(String[] args) throws Exception
	{
		JobConf conf = new JobConf(WordCount.class);
		conf.setJobName("wordcount");
	
		conf.setOutputKeyClass(Text.class);
		conf.setOutputValueClass(IntWritable.class);
		
		conf.setMapperClass(WordCountMapper.class);
		conf.setReducerClass(WordCountReducer.class);
		conf.setCombinerClass(WordCountReducer.class);
		
		conf.setInputFormat(TextInputFormat.class);
		conf.setOutputFormat(TextOutputFormat.class);
		
		FileInputFormat.setInputPaths(conf, new Path("input"));
		FileOutputFormat.setOutputPath(conf, new Path("output"));
	
		JobClient.runJob(conf);
	}
}
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;


public class WordCountMapper extends MapReduceBase
	implements Mapper<LongWritable, Text, Text, IntWritable>
{
	private final IntWritable one = new IntWritable(1);
	private Text word = new Text();
	
	public void map(LongWritable key, Text value,
			OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
	{
		String line = value.toString();
		StringTokenizer itr = new StringTokenizer(line.toLowerCase());
		while(itr.hasMoreTokens())
		{
			word.set(itr.nextToken());
			output.collect(word, one);
		}
	}
	
}
import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;


public class WordCountReducer extends MapReduceBase 
	implements Reducer<Text, IntWritable, Text, IntWritable>
{
	public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
	{
		int sum = 0;
		while(values.hasNext())
		{
			IntWritable value = (IntWritable) values.next();
			sum += value.get();
		}
		output.collect(key, new IntWritable(sum));
	}
}

I then jarred it up as wordcount.jar and sent it to the Hadoop virtual machine. Finally, I created a new directory in the HDFS and copied a text file so that the sample can read in it to count words.

The following are a few commands I used in the virtual machine:

hadoop-user@hadoop-desk:~ $ ./init-hdfs
hadoop-user@hadoop-desk:~ $ ./start-hadoop
hadoop-user@hadoop-desk:~ $ hadoop fs -mkdir input
hadoop-user@hadoop-desk:~ $ hadoop fs -put ../foo.txt /user/hadoop-user/input
hadoop-user@hadoop-desk:~ $ hadoop fs -ls input/
hadoop-user@hadoop-desk:~/hadoop-0.18.0$ hadoop jar wordcount.jar WordCount
hadoop-user@hadoop-desk:~ /hadoop-0.18.0$ hadoop fs -ls output/
hadoop-user@hadoop-desk:~ /hadoop-0.18.0$ hadoop fs -get output/part-00000
hadoop-user@hadoop-desk:~/hadoop-0.18.0$ hadoop fs -rmr /user/hadoop-user/output

After trying the WordCount sample and reading through two tutorials, I got a good understanding of MapReduce and Hadoop at a very high level. To get some real work done, I think I need to study more. That is why I order the book Hadoop – the definitive guide. I will write more after reading through the book in about one month. Stay tuned.

  1. beginner1010
    July 3rd, 2012 at 16:24 | #1

    after two months, I could solve my problem. I`m happiest man in the world now !!!!!!
    I was using commons-logging-1.1.1 It doesn’t work for hadoop-0.18.0
    If you download hadoop-0.20.2 for example, use their lib, it works.

    thanks god 😀

  2. Vidya
    September 24th, 2012 at 18:59 | #2

    I am trying to run tutorial by Yahoo, which is based on Hadoop 0.18.0 virtual machine. I am getting error on eclipse – Call to /192.168.94.9000 fail on local exception: java.io.EOFException – What might be missing in configuration on eclipse side?

  3. Shiva
    April 6th, 2013 at 16:59 | #3

    Hi…Thanks for the brief explanation of your experience. I face the same problem in eclipse configuration. I am getting the Error code “Error : null”. Could you please tell me how you managed to get the configuration as given in the tutorial. Waiting for your valuable feedback.

  4. Gavaskar Rathnam
    June 27th, 2013 at 04:45 | #4

    Hi,

    I am also getting the same error on eclipse – Call to /192.168.94.9000 fail on local exception: java.io.EOFException.

    Please guide us to resolve this issue.

  5. Chirag
  1. No trackbacks yet.