Calliope - Bridge between Cassandra and Spark framework

  •        1656

Calliope provides a bridge between Cassandra and Spark framework allowing you to create those magical, realtime bigdata apps with ease. It is a library providing an interface to consume data from Cassandra to spark and store RDDs from Spark to Cassandra.



Related Projects

spark-cassandra-connector - DataStax Spark Cassandra Connector

Lightning-fast cluster computing with Apache Spark™ and Apache Cassandra®.This library lets you expose Cassandra tables as Spark RDDs, write Spark RDDs to Cassandra tables, and execute arbitrary CQL queries in your Spark applications.

GeoMesa - Suite of tools for working with big geo-spatial data in a distributed fashion

GeoMesa is an open-source, distributed, spatio-temporal database built on a number of distributed cloud data storage systems, including Accumulo, HBase, Cassandra, and Kafka. Leveraging a highly parallelized indexing strategy, GeoMesa aims to provide as much of the spatial querying and data manipulation to Accumulo as PostGIS does to Postgres.

Spark - Fast Cluster Computing

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop.

akka-analytics - Large-scale event processing with Akka Persistence and Apache Spark

Events for a given persistenceId are partitioned across nodes in the Cassandra cluster where the partition is represented by the partition field in the key. The eventTable() method returns an RDD in which events with the same persistenceId - partition combination (= cluster partition) are ordered by increasing sequenceNr but the ordering across cluster partitions is not defined. If needed the RDD can be sorted with sortByKey() by persistenceId, partition and sequenceNr in that order of significance. Btw, the default size of a cluster partition in the Cassandra journal is 5000000 events (see akka-persistence-cassandra). The stream of events (written by all persistent actors) is partially ordered i.e. events with the same persistenceId are ordered by sequenceNr whereas the ordering of events with different persistenceId is not defined. Details about Kafka consumer params are described here.

TensorFlowOnSpark - TensorFlowOnSpark brings TensorFlow programs onto Apache Spark clusters

TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters. By combining salient features from deep learning framework TensorFlow and big-data frameworks Apache Spark and Apache Hadoop, TensorFlowOnSpark enables distributed deep learning on a cluster of GPU and CPU servers.TensorFlowOnSpark was developed by Yahoo for large-scale distributed deep learning on our Hadoop clusters in Yahoo's private cloud.

Shark - Hive on Spark

Shark is an open source distributed SQL query engine for Hadoop data. It brings state-of-the-art performance and advanced analytics to Hive users. It runs Hive queries up to 100x faster in memory, or 10x on disk. it is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive.

MMLSpark - Microsoft Machine Learning for Apache Spark

MMLSpark provides a number of deep learning and data science tools for Apache Spark, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK) and OpenCV, enabling you to quickly create powerful, highly-scalable predictive and analytical models for large image and text datasets.MMLSpark requires Scala 2.11, Spark 2.1+, and either Python 2.7 or Python 3.5+. See the API documentation for Scala and for PySpark.

Mobius - C# and F# language binding and extensions to Apache Spark

Mobius provides C# language binding to Apache Spark enabling the implementation of Spark driver program and data processing operations in the languages supported in the .NET framework like C# or F#.For more code samples, refer to Mobius\examples directory or Mobius\csharp\Samples directory.


CaffeOnSpark brings deep learning to Hadoop and Spark clusters. By combining salient features from deep learning framework Caffe and big-data frameworks Apache Spark and Apache Hadoop, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers.As a distributed extension of Caffe, CaffeOnSpark supports neural network model training, testing, and feature extraction. Caffe users can now perform distributed learning using their existing LMDB data files and minorly adjusted network configuration (as illustrated).

Cassandra - Scalable Distributed Database

The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model. Cassandra is suitable for applications that can't afford to lose data. Data is automatically replicated to multiple nodes for fault-tolerance.

genie - Distributed Big Data Orchestration Service

Genie is a federated job orchestration engine developed by Netflix. Genie provides REST-ful APIs to run a variety of big data jobs like Hadoop, Pig, Hive, Spark, Presto, Sqoop and more. It also provides APIs for managing the metadata of many distributed processing clusters and the commands and applications which run on them.See the official website to find documentation about Genie and specific documentation for various releases.

lopq - Training of Locally Optimized Product Quantization (LOPQ) models for approximate nearest neighbor search of high dimensional data in Python and Spark

This is Python training and testing code for Locally Optimized Product Quantization (LOPQ) models, as well as Spark scripts to scale training to hundreds of millions of vectors. The resulting model can be used in Python with code provided here or deployed via a Protobuf format to, e.g., search backends for high performance approximate nearest neighbor search.Locally Optimized Product Quantization (LOPQ) [1] is a hierarchical quantization algorithm that produces codes of configurable length for data points. These codes are efficient representations of the original vector and can be used in a variety of ways depending on the application, including as hashes that preserve locality, as a compressed vector from which an approximate vector in the data space can be reconstructed, and as a representation from which to compute an approximation of the Euclidean distance between points.

cpp-driver - DataStax C/C++ Driver for Apache Cassandra

A modern, feature-rich, and highly tunable C/C++ client library for Apache Cassandra (1.2+) and DataStax Enterprise (3.1+) using exclusively Cassandra's native protocol and Cassandra Query Language v3.Note: DataStax products do not support big-endian systems.

sparklyr - R interface for Apache Spark

If you use the RStudio IDE, you should also download the latest preview release of the IDE which includes several enhancements for interacting with Spark (see the RStudio IDE section below for more details). The returned Spark connection (sc) provides a remote dplyr data source to the Spark cluster.

Logsandra - log management using Cassandra

Logsandra is a log management application written in Python and using Cassandra as back-end. It is written as demo for cassandra but it is worth to take a look. It provides support to create your own parser.

aegisthus - A Bulk Data Pipeline out of Cassandra

Aegisthus has been transitioned to maintenance mode. It is still used for ETL at Netflix for Cassandra 2.x clusters, but it will not be evolving further.A Bulk Data Pipeline out of Cassandra. Aegisthus implements a reader for the SSTable format and provides a map/reduce program to create a compacted snapshot of the data contained in a column family.

Fluo - Make incremental updates to large data sets stored in Apache Accumulo

Apache Fluo (incubating) is an open source implementation of Percolator (which populates Google's search index) for Apache Accumulo. Fluo makes it possible to update the results of a large-scale computation, index, or analytic as new data is discovered. When combining new data with existing data, Fluo offers reduced latency when compared to batch processing frameworks (e.g Spark, MapReduce).