Calliope - Bridge between Cassandra and Spark framework

  •        1644

Calliope provides a bridge between Cassandra and Spark framework allowing you to create those magical, realtime bigdata apps with ease. It is a library providing an interface to consume data from Cassandra to spark and store RDDs from Spark to Cassandra.



Related Projects

spark-cassandra-connector - DataStax Spark Cassandra Connector

Lightning-fast cluster computing with Apache Spark™ and Apache Cassandra®.This library lets you expose Cassandra tables as Spark RDDs, write Spark RDDs to Cassandra tables, and execute arbitrary CQL queries in your Spark applications.

SparkBuildExamples - Example projects for using Spark and Cassandra With DSE Analytics

These are template projects that illustrate how to build Spark Application written in Java or Scala with Maven, SBT or Gradle which can be run on either DataStax Enterprise (DSE) or Apache Spark. The example project implements a simple write-to-/read-from-Cassandra application for each language and build tool.Compiling Spark applications depends on Apache Spark and optionally on Spark Cassandra Connector jars. Projects dse and oss show two different ways of supplying these dependencies. Both projects are built and executed with similar commands.

spark-cassandra-stress - A tool for testing the DataStax Spark Connector against Apache Cassandra or DSE

DSE libraries are located by looking for the installation of DSE on your machine. Change environment variables DSE_HOME and DSE_RESOURCES if your installation differs from the default.When getting libraries from Maven we need to specify the Connector version and Spark Version libraries to compile against. Change environment variables CONNECTOR_VERSION and SPARK_VERSION to the artifacts you would like to use.

Cassandra-Spark-Demo - Demo for the Spark Cassandra connector

Demo for the Spark Cassandra connector

GeoMesa - Suite of tools for working with big geo-spatial data in a distributed fashion

GeoMesa is an open-source, distributed, spatio-temporal database built on a number of distributed cloud data storage systems, including Accumulo, HBase, Cassandra, and Kafka. Leveraging a highly parallelized indexing strategy, GeoMesa aims to provide as much of the spatial querying and data manipulation to Accumulo as PostGIS does to Postgres.


A subproject of Predictiveworks that provides common access to Cassandra, Elasticsearch, HBase, MongoDB, Parquet, JDBC database and other data sources from Apache Spark.

Spark - Fast Cluster Computing

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop.

Flare - Decentralized Processing using Spark and Ethereum

Flare is the (uncontested) first implementation of decentralized computing with Ethereum. The goal is to use Apache Spark and Cassandra, two technologies designed for cluster computation, and integrate a connection with Ethereum to provide trusted, verifiably correct, computation in untrusted systems. This will allow anyone with internet access to run code in a distributed decentralized processing network.

uberscriptquery - UberScriptQuery, a SQL-like DSL to make writing Spark jobs super easy

UberScriptQuery is a script query wrapper to run Spark SQL jobs.Why did we build this? Apache Spark is a great tool to do data processing, yet people usually end up writing many similar Spark jobs. There is substantial development cost to write and maintain all these jobs. Additionally, Spark is still mostly for developers, and other people such as data analysts or data scientists may still feel that Spark has a steep learning curve.

spark-parquet-thrift-example - Example Spark project using Parquet as a columnar store with Thrift objects

Apache Spark is a research project for distributed computing which interacts with HDFS and heavily utilizes in-memory caching. Modern datasets contain hundreds or thousands of columns and are too large to cache all the columns in Spark's memory, so Spark has to resort to paging to disk. The disk paging penalty can be lessened or removed if the Spark application only interacts with a subset of the columns in the overall database by using a columnar store database such as Parquet, which will only load the specified columns of data into a Spark RDD.Matt Massie's example uses Parquet with Avro for data serialization and filters loading based on an equality predicate, but does not show how to load only a subset of columns. This project shows a complete Scala/sbt project using Thrift for data serialization and shows how to load columnar subsets.

snappy-spark - Apache Spark with SnappyData extensions

Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing.

Shark - Hive on Spark

Shark is an open source distributed SQL query engine for Hadoop data. It brings state-of-the-art performance and advanced analytics to Hive users. It runs Hive queries up to 100x faster in memory, or 10x on disk. it is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive.

MMLSpark - Microsoft Machine Learning for Apache Spark

MMLSpark provides a number of deep learning and data science tools for Apache Spark, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK) and OpenCV, enabling you to quickly create powerful, highly-scalable predictive and analytical models for large image and text datasets.MMLSpark requires Scala 2.11, Spark 2.1+, and either Python 2.7 or Python 3.5+. See the API documentation for Scala and for PySpark.

PySpark-Predictive-Maintenance - Predictive Maintenance using Pyspark

Predictive maintenance is one of the most common machine learning use cases and with the latest advancements in information technology, the volume of stored data is growing faster in this domain than ever before which makes it necessary to leverage big data analytic capabilities to efficiently transform large amounts of data into business intelligence. Microsoft has published a series of learning materials including blogs, solution templates, modeling guides and sample tutorials in the domain of predictive maintenance. In this tutorial, we extended those materials by providing a detailed step-by-step process of using Spark Python API PySpark to demonstrate how to approach predictive maintenance for big data scenarios. The tutorial covers typical data science steps such as data ingestion, cleansing, feature engineering and model development.The input data is simulated to reflect features that are generic for most of the predictive maintenance scenarios. To enable the tutorial to be completed very quickly, the data was simulated to be around 1.3 GB but the same PySpark framework can be easily applied to a much larger data set. The data is hosted on a publicly accessible Azure Blob Storage container and can be downloaded from here. In this tutorial, we import the data directly from the blob storage.

Mobius - C# and F# language binding and extensions to Apache Spark

Mobius provides C# language binding to Apache Spark enabling the implementation of Spark driver program and data processing operations in the languages supported in the .NET framework like C# or F#.For more code samples, refer to Mobius\examples directory or Mobius\csharp\Samples directory.

aztk - On-demand, Dockerized, Spark Jobs on Azure (powered by Azure Batch)

Azure Distributed Data Engineering Toolkit is a python CLI application for provisioning on-demand Spark on Docker clusters in Azure. It's a cheap and easy way to get up and running with a Spark cluster, and a great tool for Spark users who want to experiment and start testing at scale.This toolkit is built on top of Azure Batch but does not require any Azure Batch knowledge to use.

spark-cluster-deployment - Automates Spark standalone cluster tasks with Puppet and Fabric.

Apache Spark is a research project for distributed computing which interacts with HDFS and heavily utilizes in-memory caching. Spark 1.0.0 can be deployed to traditional cloud and job management services such as EC2, Mesos, or Yarn. Further, Spark's standalone cluster mode enables Spark to run on other servers without installing other job management services.However, configuring and submitting applications to a Spark 1.0.0 standalone cluster currently requires files to be synchronized across the entire cluster, including the Spark installation directory. This project utilizes Fabric and Puppet to further automate the Spark standalone cluster. The Puppet scripts are MIT-licensed from stefanvanwouw/puppet-spark and wikimedia/puppet-cdh4.

cassandra-modeling-kata - Cassandra Modeling Kata

At Allegro we operate in a highly distributed (Microservices), data-intensive cloud environment and Apache Cassandra is a great asset in our PolyglotPersistence toolbox. Apache Cassandra is becoming our No. 1 choice for cloud-based solutions due to its high availability, linear scaling and flexible data modeling capabilities.In this kata, I will introduce basic Apache Cassandra modeling techniques. Next, we will use these techniques in practice - to develop a simple e-commerce application.

Cassandra - Scalable Distributed Database

The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model. Cassandra is suitable for applications that can't afford to lose data. Data is automatically replicated to multiple nodes for fault-tolerance.

genie - Distributed Big Data Orchestration Service

Genie is a federated job orchestration engine developed by Netflix. Genie provides REST-ful APIs to run a variety of big data jobs like Hadoop, Pig, Hive, Spark, Presto, Sqoop and more. It also provides APIs for managing the metadata of many distributed processing clusters and the commands and applications which run on them.See the official website to find documentation about Genie and specific documentation for various releases.