I just took a Hadoop developer training in the week of September 10. To me, Hadoop is not totally new as I’ve tried HelloWorld sample and Serengeti project. Still, I found it’s nice to get away from daily job and go through a series of lectures and hands-on labs in a training setting. Believe it or not, I felt more tired after training than a typical working day. This post is not much new but just helps me on the commands when needed later.
Hadoop File System (HDFS) is a fundamental building block in Hadoop ecosystem. It’s a file system designed to store big data including input data and result data. For that, HDFS distributes big files across networked data nodes. Although logically continuous, a big file can be split into many chucks, each of which can be saved on a different physical machine.
You can access the files with APIs, but more often with the command lines (which is, BTW, an application built on top of the HDFS APIs). There are about 30 commands to manage a Hadoop file system remotely, for example from a Linux shell. Don’t confuse the Hadoop file system with your local file system. In some way, you can think of Hadoop file system as a file system on another machine.
The basic syntax of HDFS commands is as follows:
$ hadoop fs -command [extra arguments]
$ hadoop fs -ls
The first part “hadoop fs” is always the same for file system related commands. After that is very much like typical Unix/Linux commands in syntax. Besides managing the HDFS itself, there are commands to import data files from local file system to HDFS, and export data files from HDFS to local file system. These commands are unique therefore deserve most attention.
[-put ... ] [-copyFromLocal ... ] [-moveFromLocal ... ] [-get [-ignoreCrc] [-crc] ] [-getmerge [addnl]] [-copyToLocal [-ignoreCrc] [-crc] ] [-moveToLocal [-crc] ]
A Typical Use Case
When using Hadoop, you need to move your data to a HDFS before processing it, and optionally move the result back to your local file system. Here is a typical flow:
$ hadoop fs -mkdir test $ hadoop fs -put input.txt test/input.txt $ hadoop fs -ls test $ hadoop fs -cat test/input.txt $ hadoop jar mr.jar WordCount test/input.txt test/output $ hadoop fs -ls test/output $ hadoop fs -lss test $ hadoop fs -get test/output .
Other Useful Commands
There are other commands you will find useful, for example the commands listed below:
$ hadoop fs -chmod 777 test/input.txt $ hadoop fs -cp test/input.txt test/input1.txt $ hadoop fs -cp test/input.txt test/input1.txt $ hadoop fs -rmr test
Space use in bytes for individual files or directories
$ hadoop fs -du
Space used in bytes in summary, therefore only one entry is given
$ hadoop fs -dus $ hadoop fs -count /test
Lastly but not least is the help command. When in doubt, you can always use help:
$hadoop fs -help
Don’t forget the “-“ before the help, or you will see something similar but different. You can also add specific command you want to get help on, for example,
$hadoop fs -help copyFromLocal