hdfs3¶
Use HDFS natively from Python.
The Hadoop File System (HDFS) is a widely deployed, distributed, data-local file system written in Java. This file system backs most clusters running Hadoop and Spark.
Pivotal produced libhdfs3, an alternative native C/C++ HDFS client that interacts with HDFS without the JVM, exposing first class support to non-JVM languages like Python.
This library, hdfs3
, is a lightweight Python wrapper around the C/C++
libhdfs3
library. It provides both direct access to libhdfs3 from Python
as well as a typical Pythonic interface.
>>> from hdfs3 import HDFileSystem
>>> hdfs = HDFileSystem(host='localhost', port=8020)
>>> hdfs.ls('/user/data')
>>> hdfs.put('local-file.txt', '/user/data/remote-file.txt')
>>> hdfs.cp('/user/data/file.txt', '/user2/data')
HDFS3 files comply with the Python File interface. This enables interactions with the broader ecosystem of PyData projects.
>>> with hdfs.open('/user/data/file.txt') as f:
... data = f.read(1000000)
>>> with hdfs.open('/user/data/file.csv.gz') as f:
... df = pandas.read_csv(f, compression='gzip', nrows=1000)
Motivation¶
We choose to use an alternative C/C++/Python HDFS client rather than the default JVM client for the following reasons:
- Convenience: Interactions between Java libraries and Native (C/C++/Python) libraries can be cumbersome. Using a native library from Python smoothes over the experience in development, maintenance, and debugging.
- Performance: Native libraries like
libhdfs3
do not suffer the long JVM startup times, improving interaction.