hdfs3

Use HDFS natively from Python.

The Hadoop File System (HDFS) is a widely deployed, distributed, data-local file system written in Java. This file system backs most clusters running Hadoop and Spark.

Pivotal produced libhdfs3, an alternative native C/C++ HDFS client that interacts with HDFS without the JVM, exposing first class support to non-JVM languages like Python.

This library, hdfs3, is a lightweight Python wrapper around the C/C++ libhdfs3 library. It provides both direct access to libhdfs3 from Python as well as a typical Pythonic interface.

>>> from hdfs3 import HDFileSystem
>>> hdfs = HDFileSystem(host='localhost', port=8020)
>>> hdfs.ls('/user/data')
>>> hdfs.put('local-file.txt', '/user/data/remote-file.txt')
>>> hdfs.cp('/user/data/file.txt', '/user2/data')

HDFS3 files comply with the Python File interface. This enables interactions with the broader ecosystem of PyData projects.

>>> with hdfs.open('/user/data/file.txt') as f:
...     data = f.read(1000000)

>>> with hdfs.open('/user/data/file.csv.gz') as f:
...     df = pandas.read_csv(f, compression='gzip', nrows=1000)

Motivation

We choose to use an alternative C/C++/Python HDFS client rather than the default JVM client for the following reasons:

  • Convenience: Interactions between Java libraries and Native (C/C++/Python) libraries can be cumbersome. Using a native library from Python smoothes over the experience in development, maintenance, and debugging.
  • Performance: Native libraries like libhdfs3 do not suffer the long JVM startup times, improving interaction.