smart_open - Utils for streaming large files (S3, HDFS, gzip, bz2...)

  •    Python

There are a few optional keyword arguments that are useful only for S3 access. These are both passed to boto.s3_connect() as keyword arguments. The S3 reader supports gzipped content, as long as the key is obviously a gzipped file (e.g. ends with ".gz").

ibis - A pandas-like deferred expression system, with first-class SQL support

  •    Python

Ibis is a toolbox to bridge the gap between local Python environments, remote storage, execution systems like Hadoop components (HDFS, Impala, Hive, Spark) and SQL databases. Its goal is to simplify analytical workflows and make you more productive. Learn more about using the library at http://ibis-project.org.

Spark - Fast Cluster Computing

  •    Scala

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop.

snakebite - A pure python HDFS client

  •    Python

Snakebite is a python library that provides a pure python HDFS client and a wrapper around Hadoops minicluster. The client uses protobuf for communicating with the NameNode and comes in the form of a library and a command line interface. Currently, the snakebite client supports most actions that involve the Namenode and reading data from DataNodes.Note: all methods that read data from a data node are able to check the CRC during transfer, but this is disabled by default because of performance reasons. This is the opposite behaviour from the stock Hadoop client.

hdfs - A native go client for HDFS

  •    Go

This is a native golang client for hdfs. It connects directly to the namenode using the protocol buffers API. It tries to be idiomatic by aping the stdlib os package, where possible, and implements the interfaces from it, including os.FileInfo and os.PathError.

TileDB - TileDB array data management

  •    C++

Array data management made fast and easy. TileDB allows you to manage the massive dense and sparse multi-dimensional array data that frequently arise in many important scientific applications.

camus - Mirror of Linkedin's Camus

  •    Java

Camus is LinkedIn's Kafka->HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka.

kafka-connect-hdfs - Kafka Connect HDFS connector

  •    Java

kafka-connect-hdfs is a Kafka Connector for copying data between Kafka and Hadoop HDFS.Documentation for this connector can be found here.

dynamometer - A tool for scale and performance testing of HDFS with a specific focus on the NameNode

  •    Java

Dynamometer is a tool to performance test Hadoop's HDFS NameNode. The intent is to provide a real-world environment by initializing the NameNode against a production file system image and replaying a production workload collected via e.g. the NameNode's audit logs. This allows for replaying a workload which is not only similar in characteristic to that experienced in production, but actually identical. Dynamometer will launch a YARN application which starts a single NameNode and a configurable number of DataNodes, simulating an entire HDFS cluster as a single application. There is an additional workload job run as a MapReduce job which accepts audit logs as input and uses the information contained within to submit matching requests to the NameNode, inducing load on the service.

hadoop-for-geoevent - ArcGIS GeoEvent Server sample Hadoop connector for storing GeoEvents in HDFS.

  •    Java

ArcGIS 10.4 GeoEvent Extension for Server sample Hadoop Output Connector for storing GeoEvents in HDFS. Find a bug or want to request a new feature? Please let us know by submitting an issue.

node-webhdfs - A WebHDFS module for Node.js.

  •    Javascript

I am currently following and testing against the WebHDFS REST API documentation for the 1.2.1 release, by Apache. Make sure you enable WebHDFS in the hdfs site configuration file. I use Mocha and should.js for unit testing. They will be required if you want to run the unit tests. To execute the tests, simply npm test, but install the requirements first. You will also likely need to adjust the constants in the test file first (or have a username "ryan" setup for hosts "endpoint1" and "endpoint2").

ros_hadoop - Hadoop splittable InputFormat for ROS

  •    Scala

RosbagInputFormat is an open source splittable Hadoop InputFormat for the ROS bag file format. For an example of rosbag file larger than 2 GB see doc/Rosbag larger than 2 GB.ipynb Solved the issue https://github.com/valtech/ros_hadoop/issues/6 The issue was due to ByteBuffer being limitted by JVM Integer size and has nothing to do with Spark or how the RosbagMapInputFormat works within Spark. It was only problematic to extract the conf index with the jar.

hdfs - API and command line interface for HDFS

  •    Python

API and command line interface for HDFS. See the documentation to learn more.

ukwa-manage - Shepherding our web archives from crawl to access.

  •    Python

n.b. we currently run Python 2.7 on the Hadoop cluster, so streaming Hadoop tasks need to stick to that version. Other code should be written in Python 3 but be compatible with both where possible.

hadoop-hdfs-fsimage-exporter - Exports Hadoop HDFS content statistics to Prometheus

  •    Java

Hadoop HDFS FSImage Exporter allows exporting HDFS statistics for Prometheus from the Hadoop HDFS FSImage file snapshots.

hadoop-tools - Tools for working with Hadoop, written with performance in mind.

  •    Haskell

Tools for working with Hadoop written with performance in mind. By default, hh will behave the same as hdfs dfs or hadoop fs in terms of which user name to use for HDFS, or which namenodes to use.

pyhdfs - Python HDFS client

  •    Python

Because the world needs yet another way to talk to HDFS from Python. This library provides a Python client for WebHDFS. NameNode HA is supported by passing in both NameNodes. Responses are returned as nice Python classes, and any failed operation will raise some subclass of HdfsException matching the Java exception.

euphoria - Euphoria is an open source Java API for creating unified big-data processing flows

  •    Java

A Java API for creating unified big-data processing flows providing an engine independent programming model which can express both batch and stream transformations.

