Getting started with Hadoop: My First Try
Given the growing popularity of Hadoop, I decided to give it a try by myself. As normal, I searched for a tutorial first and got one by Yahoo, which is based on Hadoop 0.18.0 virtual machine. I knew the current stable version is 1.x, but that is OK because I just wanted to get a big picture and I didn’t want to refuse the convenience of ready-to-use Hadoop virtual machine.
The tutorial is not that long so I just tried to walk through it. Because I’ve have Java and Eclipse set up, so I just downloaded the Hadoop virtual machine and ran it on VMware Player. Then I got stuck because the Eclipse plug-in required in the tutorial could not be found – I didn’t have the CD mentioned in the tutorial. It took me a while but I found a newer version of the plug-in.
After installing the plug-in, I could add a new Hadoop location in the Map/Reduce Locations view. The Hadoop location also showed up in the Eclipse Project Explorer view under the DFS Locations root node, but when it’s expanded I got error node “Error: null.” Later one I found out that the command line can do most works.
Then came the WordCount sample code which was the fun part for me. Before that, I copied the hadoop-0.18.0 directory under the hadoop-user home directory to the machine where my Eclipse runs. I then created a new project using the MapReduce project wizard (coming with Hadoop plug-in) and specify Hadoop library location there. The Hadoop plug-in simply adds all the required libraries (jar files) in Java build path so you don’t need to worry about them. If you don’t have Hadoop plug-in installed, you can manually add them, the most important one of which is the hadoop-0.18.0-core.jar.
After the project is created, I typed in the source code from the tutorial. Somehow it didn’t compile right away, I had to search around and found a similar code from Cloudera Hadoop tutorial.
With a few tweaks, the application compiled. The following are the three java files:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class WordCount
{
public static void main(String[] args) throws Exception
{
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(WordCountMapper.class);
conf.setReducerClass(WordCountReducer.class);
conf.setCombinerClass(WordCountReducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path("input"));
FileOutputFormat.setOutputPath(conf, new Path("output"));
JobClient.runJob(conf);
}
}
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WordCountMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable>
{
private final IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
{
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line.toLowerCase());
while(itr.hasMoreTokens())
{
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class WordCountReducer extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
{
int sum = 0;
while(values.hasNext())
{
IntWritable value = (IntWritable) values.next();
sum += value.get();
}
output.collect(key, new IntWritable(sum));
}
}
I then jarred it up as wordcount.jar and sent it to the Hadoop virtual machine. Finally, I created a new directory in the HDFS and copied a text file so that the sample can read in it to count words.
The following are a few commands I used in the virtual machine:
hadoop-user@hadoop-desk:~ $ ./init-hdfs hadoop-user@hadoop-desk:~ $ ./start-hadoop hadoop-user@hadoop-desk:~ $ hadoop fs -mkdir input hadoop-user@hadoop-desk:~ $ hadoop fs -put ../foo.txt /user/hadoop-user/input hadoop-user@hadoop-desk:~ $ hadoop fs -ls input/ hadoop-user@hadoop-desk:~/hadoop-0.18.0$ hadoop jar wordcount.jar WordCount hadoop-user@hadoop-desk:~ /hadoop-0.18.0$ hadoop fs -ls output/ hadoop-user@hadoop-desk:~ /hadoop-0.18.0$ hadoop fs -get output/part-00000 hadoop-user@hadoop-desk:~/hadoop-0.18.0$ hadoop fs -rmr /user/hadoop-user/output
After trying the WordCount sample and reading through two tutorials, I got a good understanding of MapReduce and Hadoop at a very high level. To get some real work done, I think I need to study more. That is why I order the book Hadoop – the definitive guide. I will write more after reading through the book in about one month. Stay tuned.

after two months, I could solve my problem. I`m happiest man in the world now !!!!!!
I was using commons-logging-1.1.1 It doesn’t work for hadoop-0.18.0
If you download hadoop-0.20.2 for example, use their lib, it works.
thanks god
I am trying to run tutorial by Yahoo, which is based on Hadoop 0.18.0 virtual machine. I am getting error on eclipse – Call to /192.168.94.9000 fail on local exception: java.io.EOFException – What might be missing in configuration on eclipse side?
Hi…Thanks for the brief explanation of your experience. I face the same problem in eclipse configuration. I am getting the Error code “Error : null”. Could you please tell me how you managed to get the configuration as given in the tutorial. Waiting for your valuable feedback.