Displaying 1 to 20 from 273 results

spark-cassandra-connector - DataStax Spark Cassandra Connector

  •    Scala

Lightning-fast cluster computing with Apache Spark™ and Apache Cassandra®.This library lets you expose Cassandra tables as Spark RDDs, write Spark RDDs to Cassandra tables, and execute arbitrary CQL queries in your Spark applications.

TensorFlowOnSpark - TensorFlowOnSpark brings TensorFlow programs onto Apache Spark clusters

  •    Python

TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters. By combining salient features from deep learning framework TensorFlow and big-data frameworks Apache Spark and Apache Hadoop, TensorFlowOnSpark enables distributed deep learning on a cluster of GPU and CPU servers.TensorFlowOnSpark was developed by Yahoo for large-scale distributed deep learning on our Hadoop clusters in Yahoo's private cloud.

primus - :zap: Primus, the creator god of the transformers & an abstraction layer for real-time to prevent module lock-in

  •    Javascript

Primus, the creator god of transformers but now also known as universal wrapper for real-time frameworks. There are a lot of real-time frameworks available for Node.js and they all have different opinions on how real-time should be done. Primus provides a common low level interface to communicate in real-time using various real-time frameworks.If you deploy your application behind a reverse proxy (Nginx, HAProxy, etc.) you might need to add WebSocket specific settings to its configuration files. If you intend to use WebSockets, please ensure that these settings have been added. There are some example configuration files available in the observing/balancerbattle repository.




spark-py-notebooks - Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

  •    Jupyter

This is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, from basic to advanced, by using the Python language. If Python is not your language, and it is R, you may want to have a look at our R on Apache Spark (SparkR) notebooks instead. Additionally, if your are interested in being introduced to some basic Data Science Engineering, you might find these series of tutorials interesting. There we explain different concepts and applications using Python and R.

kafka-storm-starter - Code examples that show to integrate Apache Kafka 0

  •    Scala

Code examples that show how to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark 1.1+ while using Apache Avro as the data serialization format. Take a look at the Kafka Streams code examples at https://github.com/confluentinc/examples.


js-spark-md5 - Lightning fast normal and incremental md5 for javascript

  •    Javascript

SparkMD5 is a fast md5 implementation of the MD5 algorithm. This script is based in the JKM md5 library which is the fastest algorithm around. This is most suitable for browser usage, because nodejs version might be faster. NOTE: Please disable Firebug while performing the test! Firebug consumes a lot of memory and CPU and slows the test by a great margin.

snappydata - SnappyData - The Spark Database. Stream, Transact, Analyze, Predict in one cluster

  •    Scala

Apache Spark is a general purpose parallel computational engine for analytics at scale. At its core, it has a batch design center and is capable of working with disparate data sources. While this provides rich unified access to data, this can also be quite inefficient and expensive. Analytic processing requires massive data sets to be repeatedly copied and data to be reformatted to suit Spark. In many cases, it ultimately fails to deliver the promise of interactive analytic performance. For instance, each time an aggregation is run on a large Cassandra table, it necessitates streaming the entire table into Spark to do the aggregation. Caching within Spark is immutable and results in stale insight. At SnappyData, we take a very different approach. SnappyData fuses a low latency, highly available in-memory transactional database (GemFireXD) into Spark with shared memory management and optimizations. Data in the highly available in-memory store is laid out using the same columnar format as Spark (Tungsten). All query engine operators are significantly more optimized through better vectorization and code generation. The net effect is, an order of magnitude performance improvement when compared to native Spark caching, and more than two orders of magnitude better Spark performance when working with external data sources.

sparkling-water - Sparkling Water provides H2O functionality inside Spark cluster

  •    Scala

Are you looking for RSparkling? It's README is available here. The Sparkling Water is developed in multiple parallel branches. Each branch corresponds to a Spark major release (e.g., branch rel-2.3 provides implementation of Sparkling Water for Spark 2.3).

spark-jobserver - REST job server for Apache Spark

  •    Scala

spark-jobserver provides a RESTful interface for submitting and managing Apache Spark jobs, jars, and job contexts. This repo contains the complete Spark job server project, including unit tests and deploy scripts. It was originally started at Ooyala, but this is now the main development repo. Other useful links: Troubleshooting, cluster, YARN client, YARN on EMR, Mesos, JMX tips.

benchm-ml - A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc

  •    R

This project aims at a minimal benchmark for scalability, speed and accuracy of commonly used implementations of a few machine learning algorithms. The target of this study is binary classification with numeric and categorical inputs (of limited cardinality i.e. not very sparse) and no missing data, perhaps the most common problem in business applications (e.g. credit scoring, fraud detection or churn prediction). If the input matrix is of n x p, n is varied as 10K, 100K, 1M, 10M, while p is ~1K (after expanding the categoricals into dummy variables/one-hot encoding). This particular type of data structure/size (the largest) stems from this author's interest in some particular business applications. Note: While a large part of this benchmark was done in Spring 2015 reflecting the state of ML implementations at that time, this repo is being updated if I see significant changes in implementations or new implementations have become widely available (e.g. lightgbm). Also, please find a summary of the progress and learnings from this benchmark at the end of this repo.

pipeline - PipelineAI: Real-Time Enterprise AI Platform

  •    HTML

Each model is built into a separate Docker image with the appropriate Python, C++, and Java/Scala Runtime Libraries for training or prediction. Use the same Docker Image from Local Laptop to Production to avoid dependency surprises.

TransmogrifAI - TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Spark with minimal hand tuning

  •    Scala

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library written in Scala that runs on top of Spark. It was developed with a focus on accelerating machine learning developer productivity through machine learning automation, and an API that enforces compile-time type-safety, modularity, and reuse. Through automation, it achieves accuracies close to hand-tuned models with almost 100x reduction in time. Skip to Quick Start and Documentation.

elephas - Distributed Deep learning with Keras & Spark

  •    Python

Schematically, elephas works as follows. Elephas brings deep learning with Keras to Spark. Elephas intends to keep the simplicity and high usability of Keras, thereby allowing for fast prototyping of distributed models, which can be run on massive data sets. For an introductory example, see the following iPython notebook.