snakebite - A pure python HDFS client

  •        65

Snakebite is a python library that provides a pure python HDFS client and a wrapper around Hadoops minicluster. The client uses protobuf for communicating with the NameNode and comes in the form of a library and a command line interface. Currently, the snakebite client supports most actions that involve the Namenode and reading data from DataNodes.Note: all methods that read data from a data node are able to check the CRC during transfer, but this is disabled by default because of performance reasons. This is the opposite behaviour from the stock Hadoop client.

https://github.com/spotify/snakebite

Tags
Implementation
License
Platform

   




Related Projects

hdfs - A native go client for HDFS

  •    Go

This is a native golang client for hdfs. It connects directly to the namenode using the protocol buffers API. It tries to be idiomatic by aping the stdlib os package, where possible, and implements the interfaces from it, including os.FileInfo and os.PathError.

minos - Minos is beyond a hadoop deployment system.

  •    Python

Minos is a distributed deployment and monitoring system. It was initially developed and used at Xiaomi to deploy and manage the Hadoop, HBase and ZooKeeper clusters used in the company. Minos can be easily extended to support other systems, among which HDFS, YARN and Impala have been supported in the current release. This is the command line client tool used to deploy and manage processes of various systems. You can use this client to perform various deployment tasks, e.g. installing, (re)starting, stopping a service. Currently, this client supports ZooKeeper, HDFS, HBase, YARN and Impala. It can be extended to support other systems. You can refer to the following Using Client to learn how to use it.

hadoop-hdfs - Mirror of Apache Hadoop HDFS

  •    Java

Mirror of Apache Hadoop HDFS

smart_open - Utils for streaming large files (S3, HDFS, gzip, bz2...)

  •    Python

There are a few optional keyword arguments that are useful only for S3 access. These are both passed to boto.s3_connect() as keyword arguments. The S3 reader supports gzipped content, as long as the key is obviously a gzipped file (e.g. ends with ".gz").

camus - LinkedIn's previous generation Kafka to HDFS pipeline.

  •    Java

Camus is LinkedIn's Kafka->HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka.


golang-distributed-filesystem - HDFS-alike in Go. Written to learn the language and get a job.

  •    Go

Writing a HDFS clone in Go to learn more about Go and the nitty-gritty of distributed systems.

Hue - The open source Apache Hadoop UI

  •    Java

Hue is a Web application for interacting with Apache Hadoop. It supports a FileBrowser for accessing HDFS, JobBrowser for accessing MapReduce jobs (MR1/MR2-YARN), Job Designer for creating MapReduce/Streaming/Java jobs, HBase Browser for exploring and modifying HBase tables and data, Oozie App for submitting and scheduling workflows and bundles, A Pig/HBase/Sqoop2 shell, Beeswax application for executing Hive queries, Search app for querying Solr and Solr Cloud.

MMLSpark - Microsoft Machine Learning for Apache Spark

  •    Scala

MMLSpark provides a number of deep learning and data science tools for Apache Spark, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK) and OpenCV, enabling you to quickly create powerful, highly-scalable predictive and analytical models for large image and text datasets.MMLSpark requires Scala 2.11, Spark 2.1+, and either Python 2.7 or Python 3.5+. See the API documentation for Scala and for PySpark.

spark-ec2 - Scripts used to setup a Spark cluster on EC2

  •    Python

Please note: spark-ec2 is no longer under active development and the project has been archived. All the existing code, PRs and issues are still accessible but are now read-only. If you're looking for a similar tool that is under active development, we recommend you take a look at Flintrock. spark-ec2 allows you to launch, manage and shut down Apache Spark [1] clusters on Amazon EC2. It automatically sets up Apache Spark and HDFS on the cluster for you. This guide describes how to use spark-ec2 to launch clusters, how to run jobs on them, and how to shut them down. It assumes you've already signed up for an EC2 account on the Amazon Web Services site.

spotify-gnome - Gnome media key integration for the Spotify Linux client

  •    Python

Gnome media key integration for the Spotify Linux client

raspotify - Spotify Connect client for the Raspberry Pi that Just Works™

  •    Shell

Spotify Connect client for the Raspberry Pi that Just Works™. Raspotify is a Spotify Connect client for Raspbian on the Raspberry Pi that Just Works™. Raspotify is a Debian package and associated repository which thinly wraps the awesome librespot library by Paul Lietar and others. It works out of the box on all three revisions of the Pi, immediately after installation.

librespot - Open Source Spotify client library

  •    Rust

librespot is an open source client library for Spotify. It enables applications to use Spotify's service, without using the official but closed-source libspotify. Additionally, it will provide extra features which are not available in the official library. Note: librespot only works with Spotify Premium.

librespot - Open Source Spotify client library

  •    Rust

librespot is an open source client library for Spotify. It enables applications to use Spotify's service, without using the official but closed-source libspotify. Additionally, it will provide extra features which are not available in the official library. As the origin by plietar is no longer actively maintained, this organisation and repository have been set up so that the project may be maintained and upgraded in the future.

spotifyd - A spotify daemon

  •    Rust

An open source Spotify client running as a UNIX daemon. Spotifyd streams music just like the official client, but is more lightweight and supports more platforms. Spotifyd also supports the Spotify Connect protocol which makes it show up as a device that can be controlled from the official clients. Spotifyd requires a Spotify Premium account.

spotify-web-api-js - A client-side JS wrapper for the Spotify Web API

  •    TypeScript

This is a lightweight wrapper for the Spotify Web API (2.4kB gzipped + compressed). It includes helper functions for all Spotify's endpoints, such as fetching metadata (search and look-up of albums, artists, tracks, playlists, new releases) and user's information (follow users, artists and playlists, and saved tracks management). It doesn't have any dependencies and supports callbacks and promises. It is intended to be run on a browser, but if you want to use Node.JS to make the requests, please check spotify-web-api-node.

Spark - Fast Cluster Computing

  •    Scala

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop.

Ambari - Monitor Hadoop Cluster

  •    Java

The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs. The set of Hadoop components that are currently supported by Ambari includes HDFS, MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, Sqoop.

Kudu - Hadoop storage layer to enable fast analytics on fast data

  •    C++

Kudu is a storage system for tables of structured data. Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer. As a new complement to HDFS and Apache HBase, Kudu gives architects the flexibility to address a wider variety of use cases without exotic workarounds.