Home > Big Data > Hadoop File System APIs

Hadoop File System APIs

October 1st, 2012 Leave a comment Go to comments

As mentioned in my previous post on Hadoop File System commands, the commands are built on top of the HDFS APIs. These APIs are defined in the org.apache.hadoop.fs package, including several interfaces and over 20 classes, enums, and exceptions (the number of interfaces and classes varied from release to release).

As always, it’s best to start with a sample code while learning new APIs. The following sample copies a file from local file system to HDFS.

Lost VMs or Containers? Too Many Consoles? Too Slow GUI? Time to learn how to "Google" and manage your VMware and clouds in a fast and secure HTML5 App.

import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
import java.io.*;
public class CopyFileToHDFS
  public static void main(String[] args)
      String inputPath = args[0];
      String outputPath = args[1];
        CopyFileToHDFS copier= new CopyFileToHDFS();
        copier.copyToHDFS(new Path(inputPath), new Path(outputPath));
      } catch(Exception e) {}
  public void copyToHDFS(Path inPath, Path outPath) throws IOException
    Configuration config = new Configuration();
    FileSystem hdfs = FileSystem.get(config);
    LocalFileSystem local = FileSystem.getLocal(config);
    FSDataInputStream inStream = local.open(inPath);
    FSDataOutputStream outStream = hdfs.create(outPath);
    byte[] buf = new byte[1024];
    int data = 0;
      outStream.write(buf, 0, 1024);

The key part of the sample is in the copyToHDFS() method. It first creates a new Configuration object, from which both local file system and Hadoop File System are retrieved via the static factory methods get() and getLocal(). These three calls are pretty much very similar for applications using the APIs. What differs most comes after getting hold of the file systems. If you don’t need to work on both file systems, you can skip one.

Let’s get back to the sample. From both file systems, FSDataInputStream and FSDataOutputStream are created per the file paths. After that point, there is not much difference of copying from a stream to another steam in Java.

In addition to the HDFS CLIs, another common use case for the HDFS APIs is the Input Format, the first step of the MapReduce processing, that reads in a data file and split it into key/value pairs. I’ll cover the whole data processing pipeline of various stages later in a separate post.

To understand better all the APIs, you can browse the API Reference of the HDFS. You don’t want to spend too much time on it because it’s a waste of time if you won’t use it any time soon.

As last note, I feel the way in which the configuration and file system are associated and coded is not straight forward, especially the configuration creation. The configuration creation could have been more explicit by employing similar factory pattern like Runtime.getRuntime(), or Desktop.getDesktop() in Java system library. In that, it could be Configuration.getConfiguration().

Categories: Big Data Tags: , ,
  1. October 2nd, 2012 at 00:19 | #1

    Hadoop File System APIs http://t.co/TIxDYubz via @sjin2008

  1. October 2nd, 2012 at 07:10 | #1
  2. October 22nd, 2012 at 17:56 | #2