JanusGraph - Distributed graph database

JanusGraph is a highly scalable graph database optimized for storing and querying large graphs with billions of vertices and edges distributed across a multi-machine cluster. JanusGraph is a transactional database that can support thousands of concurrent users, complex traversals, and analytic graph queries.

tera - An Internet-Scale Database.

Copyright 2015, Baidu, Inc. Tera is the collection of many sparse, distributed, multidimensional tables. The table is indexed by a row key, column key, and a timestamp; each value in the table is an uninterpreted array of bytes.

OpenTSDB - A scalable, distributed Time Series Database.

OpenTSDB is a distributed, scalable Time Series Database (TSDB) written on top of HBase. OpenTSDB was written to address a common need: store, index and serve metrics collected from computer systems (network gear, operating systems, applications) at a large scale, and make this data easily accessible and graphable.

Apache Trafodion - Webscale SQL-on-Hadoop solution enabling transactional or operational workloads on Apache Hadoop.

Apache Trafodion is a webscale SQL-on-Hadoop solution enabling transactional or operational workloads on Apache Hadoop. Trafodion builds on the scalability, elasticity, and flexibility of Hadoop. Trafodion extends Hadoop to provide guaranteed transactional integrity, enabling new kinds of big data applications to run on Hadoop.

Elasticsearch-Exporter - A small script to export data from one Elasticsearch cluster into another.

A command line script to import/export data from ElasticSearch to various other storage systems. This is a brand new implementation with lots of bugs and way too little time to test everything for one lonely developer, so please consider this beta at best and provide feedback, bug reports and maybe even patches.

Kundera - JPA 1.0 ORM library for the Cassandra/Hbase/MongoDB database.

A JPA 2.0 compliant Object-Datastore Mapping Library for NoSQL Datastores. The idea behind Kundera is to make working with NoSQL Databases drop-dead simple and fun. Currently it supports Cassandra, MongoDB, HBase and Relational databases.

hbase-rdd - Spark RDD to read and write from HBase

This project allows to connect Apache Spark to HBase. Currently it is compiled with Scala 2.10 and 2.11, using the versions of Spark and HBase available on CDH5.5. Version 0.6.0 of this project works on CDH5.3, version 0.4.0 works on CDH5.1 and version 0.2.2-SNAPSHOT works on CDH5.0. Other combinations of versions may be made available in the future. This guide assumes you are using SBT. Usage of similar tools like Maven or Leiningen should work with minor differences as well.

hbase-docker - HBase running in Docker

This configuration builds a docker container to run HBase (with embedded Zookeeper) running on the files inside the container. The approach here requires editing the local server's /etc/hosts file to add an entry for the container hostname. This is because HBase uses hostnames to pass connection data back out of the container (from it's internal Zookeeper).

cbass - adding "simple" to HBase

In this example we are just muting "packing" and "unpacking" relying on the custom serialization being done prior to calling cbass, so the data is a byte array, and deserialization is done after the value is returned from cbass, since it will just return a byte array back in this case (i.e. identity function for both). notice the "pluto", it has no columns, which is also fine.

gimel - PayPal's Big Data Processing Framework

Gimel provides unified Data API to access data from any storage like HDFS, GS, Alluxio, Hbase, Aerospike, BigQuery, Druid, Elastic, Teradata, Oracle, MySQL, etc.

hbase-mr-pof - A proof of concept prototype of new HBase + Hadoop Map Reduce integration

A proof of concept prototype of new HBase + Hadoop Map Reduce integration

ansible-cloudera-hadoop - ansible playbook to deploy cloudera hadoop components to the cluster

The playbook is composed according to official cloudera guides with a primary purpose of production deployment in mind. High availability for HDFS and Yarn is implemented when a sufficient number of resources(hosts) is configured. From the other side, all of the components can be also deployed on a single host. It’s only required to place hostname(s) to the appropriate group in the hosts file, and the required services will be setup.

sparksql-for-hbase - Learn how to use Spark SQL and HSpark connector package to create / query data tables that reside in HBase region servers

Apache HBase is an open source, NOSQL distributed database which runs on top of the Hadoop Distributed File System (HDFS), and is well-suited for faster read/write operations on large datasets with high throughput and low input/output latency. But, unlike relational and traditional databases, HBase lacks support for SQL scripting, data types, etc., and requires the Java API to achieve the equivalent functionality. This journey is intended to provide application developers familiar with SQL, the ability to access HBase data tables using the same SQL commands. You will quickly learn how to create and query the data tables by using Apache Spark SQL and the HSpark connector package. This allows you to take advantage of the significant performance gains from using HBase without having to learn the Java APIs required to traditionally access the HBase data tables.