Getting started with Hadoop: My First Try
Given the growing popularity of Hadoop, I decided to give it a try by myself. As normal, I searched for a tutorial first and got one by Yahoo, which is based on Hadoop 0.18.0 virtual machine. I knew the current stable version is 1.x, but that is OK because I just wanted to get a big picture and I didn’t want to refuse the convenience of ready-to-use Hadoop virtual machine.
The tutorial is not that long so I just tried to walk through it. Because I’ve have Java and Eclipse set up, so I just downloaded the Hadoop virtual machine and ran it on VMware Player. Then I got stuck because the Eclipse plug-in required in the tutorial could not be found – I didn’t have the CD mentioned in the tutorial. It took me a while but I found a newer version of the plug-in.
Time to learn how to "Google" and manage your VMware and clouds in a fast and secure
HTML5 AppAfter installing the plug-in, I could add a new Hadoop location in the Map/Reduce Locations view. The Hadoop location also showed up in the Eclipse Project Explorer view under the DFS Locations root node, but when it’s expanded I got error node “Error: null.” Later one I found out that the command line can do most works.
Then came the WordCount sample code which was the fun part for me. Before that, I copied the hadoop-0.18.0 directory under the hadoop-user home directory to the machine where my Eclipse runs. I then created a new project using the MapReduce project wizard (coming with Hadoop plug-in) and specify Hadoop library location there. The Hadoop plug-in simply adds all the required libraries (jar files) in Java build path so you don’t need to worry about them. If you don’t have Hadoop plug-in installed, you can manually add them, the most important one of which is the hadoop-0.18.0-core.jar.
After the project is created, I typed in the source code from the tutorial. Somehow it didn’t compile right away, I had to search around and found a similar code from Cloudera Hadoop tutorial.
With a few tweaks, the application compiled. The following are the three java files:
import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.TextInputFormat; import org.apache.hadoop.mapred.TextOutputFormat; public class WordCount { public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(WordCountMapper.class); conf.setReducerClass(WordCountReducer.class); conf.setCombinerClass(WordCountReducer.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path("input")); FileOutputFormat.setOutputPath(conf, new Path("output")); JobClient.runJob(conf); } }
import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reporter; public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); while(itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } }
import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while(values.hasNext()) { IntWritable value = (IntWritable) values.next(); sum += value.get(); } output.collect(key, new IntWritable(sum)); } }
I then jarred it up as wordcount.jar and sent it to the Hadoop virtual machine. Finally, I created a new directory in the HDFS and copied a text file so that the sample can read in it to count words.
The following are a few commands I used in the virtual machine:
hadoop-user@hadoop-desk:~ $ ./init-hdfs hadoop-user@hadoop-desk:~ $ ./start-hadoop hadoop-user@hadoop-desk:~ $ hadoop fs -mkdir input hadoop-user@hadoop-desk:~ $ hadoop fs -put ../foo.txt /user/hadoop-user/input hadoop-user@hadoop-desk:~ $ hadoop fs -ls input/ hadoop-user@hadoop-desk:~/hadoop-0.18.0$ hadoop jar wordcount.jar WordCount hadoop-user@hadoop-desk:~ /hadoop-0.18.0$ hadoop fs -ls output/ hadoop-user@hadoop-desk:~ /hadoop-0.18.0$ hadoop fs -get output/part-00000 hadoop-user@hadoop-desk:~/hadoop-0.18.0$ hadoop fs -rmr /user/hadoop-user/output
After trying the WordCount sample and reading through two tutorials, I got a good understanding of MapReduce and Hadoop at a very high level. To get some real work done, I think I need to study more. That is why I order the book Hadoop – the definitive guide. I will write more after reading through the book in about one month. Stay tuned.
after two months, I could solve my problem. I`m happiest man in the world now !!!!!!
I was using commons-logging-1.1.1 It doesn’t work for hadoop-0.18.0
If you download hadoop-0.20.2 for example, use their lib, it works.
thanks god 😀
I am trying to run tutorial by Yahoo, which is based on Hadoop 0.18.0 virtual machine. I am getting error on eclipse – Call to /192.168.94.9000 fail on local exception: java.io.EOFException – What might be missing in configuration on eclipse side?
Hi…Thanks for the brief explanation of your experience. I face the same problem in eclipse configuration. I am getting the Error code “Error : null”. Could you please tell me how you managed to get the configuration as given in the tutorial. Waiting for your valuable feedback.
Hi,
I am also getting the same error on eclipse – Call to /192.168.94.9000 fail on local exception: java.io.EOFException.
Please guide us to resolve this issue.
Error:Null
problem resolved
http://stackoverflow.com/questions/19108060/not-able-to-run-hadoop-from-eclipse-saying-hadoop-location-in-eclipseerrornull/20737277