Hadoop File System APIs

As mentioned in my previous post on Hadoop File System commands, the commands are built on top of the HDFS APIs. These APIs are defined in the org.apache.hadoop.fs package, including several interfaces and over 20 classes, enums, and exceptions (the number of interfaces and classes varied from release to release).

As always, it’s best to start with a sample code while learning new APIs. The following sample copies a file from local file system to HDFS.

Lost VMs or Containers? Too Many Consoles? Too Slow GUI? Time to learn how to "Google" and manage your VMware and clouds in a fast and secure HTML5 App.

import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
import java.io.*;
public class CopyFileToHDFS
  public static void main(String[] args)
      String inputPath = args[0];
      String outputPath = args[1];
        CopyFileToHDFS copier= new CopyFileToHDFS();
        copier.copyToHDFS(new Path(inputPath), new Path(outputPath));
      } catch(Exception e) {}
  public void copyToHDFS(Path inPath, Path outPath) throws IOException
    Configuration config = new Configuration();
    FileSystem hdfs = FileSystem.get(config);
    LocalFileSystem local = FileSystem.getLocal(config);
    FSDataInputStream inStream = local.open(inPath);
    FSDataOutputStream outStream = hdfs.create(outPath);
    byte[] buf = new byte[1024];
    int data = 0;
      outStream.write(buf, 0, 1024);

The key part of the sample is in the copyToHDFS() method. It first creates a new Configuration object, from which both local file system and Hadoop File System are retrieved via the static factory methods get() and getLocal(). These three calls are pretty much very similar for applications using the APIs. What differs most comes after getting hold of the file systems. If you don’t need to work on both file systems, you can skip one.

Let’s get back to the sample. From both file systems, FSDataInputStream and FSDataOutputStream are created per the file paths. After that point, there is not much difference of copying from a stream to another steam in Java.

In addition to the HDFS CLIs, another common use case for the HDFS APIs is the Input Format, the first step of the MapReduce processing, that reads in a data file and split it into key/value pairs. I’ll cover the whole data processing pipeline of various stages later in a separate post.

To understand better all the APIs, you can browse the API Reference of the HDFS. You don’t want to spend too much time on it because it’s a waste of time if you won’t use it any time soon.

As last note, I feel the way in which the configuration and file system are associated and coded is not straight forward, especially the configuration creation. The configuration creation could have been more explicit by employing similar factory pattern like Runtime.getRuntime(), or Desktop.getDesktop() in Java system library. In that, it could be Configuration.getConfiguration().

This entry was posted in Big Data and tagged , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

One Comment

  1. Posted October 2, 2012 at 12:19 am | Permalink

    Hadoop File System APIs http://t.co/TIxDYubz via @sjin2008

2 Trackbacks

  • By Tofa IT » Hadoop File System APIs on October 2, 2012 at 7:10 am

    […] number of interfaces and classes varied from release to release). As always, it’s best to […]Hadoop File System APIs originally appeared on DoubleCloud by Steve Jin, author of VMware VI and vSphere SDK (Prentice […]

  • By Hadoop MapReduce Data Flow | DoubleCloud.org on October 22, 2012 at 5:56 pm

    […] This first step reads in big data via HDFS APIs and chops it into many key/value pairs. For different types of source data, Hadoop provides […]

Post a Comment

Your email is never published nor shared. Required fields are marked *


You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>


    My company has created products like vSearch ("Super vCenter"), vijavaNG APIs, EAM APIs, ICE tool. We also help clients with virtualization and cloud computing on customized development, training. Should you, or someone you know, need these products and services, please feel free to contact me: steve __AT__ doublecloud.org.

    Me: Steve Jin, VMware vExpert who authored the VMware VI and vSphere SDK by Prentice Hall, and created the de factor open source vSphere Java API while working at VMware engineering. Companies like Cisco, EMC, NetApp, HP, Dell, VMware, are among the users of the API and other tools I developed for their products, internal IT orchestration, and test automation.