Lightning-fast cluster computing with Apache Spark™ and Apache Cassandra®.This library lets you expose Cassandra tables as Spark RDDs, write Spark RDDs to Cassandra tables, and execute arbitrary CQL queries in your Spark applications.
cassandra spark cassandra-client cassandra-driver cassandra-libraryTensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters. By combining salient features from deep learning framework TensorFlow and big-data frameworks Apache Spark and Apache Hadoop, TensorFlowOnSpark enables distributed deep learning on a cluster of GPU and CPU servers.TensorFlowOnSpark was developed by Yahoo for large-scale distributed deep learning on our Hadoop clusters in Yahoo's private cloud.
tensorflow spark yahoo machine-learning cluster featuredPrimus, the creator god of transformers but now also known as universal wrapper for real-time frameworks. There are a lot of real-time frameworks available for Node.js and they all have different opinions on how real-time should be done. Primus provides a common low level interface to communicate in real-time using various real-time frameworks.If you deploy your application behind a reverse proxy (Nginx, HAProxy, etc.) you might need to add WebSocket specific settings to its configuration files. If you intend to use WebSockets, please ensure that these settings have been added. There are some example configuration files available in the observing/balancerbattle repository.
real-time websocket framework sockjs browserchannel polling http nodejs node abstraction engine.io comet streaming pubsub pub sub ajax xhr faye io primus prumus realtime socket socket.io sockets spark transformer transformers websockets ws uwsIt is the generic golden program for deep learning with TensorFlow.Following are the supported features.
tensorflow tfrecords libsvm csv deep-learning machine-learning mlp cnn lstm classifier recommendation-system cpp spark grpc android mavenIPython Notebook(s) demonstrating deep learning functionality.IPython Notebook(s) demonstrating scikit-learn functionality.
machine-learning deep-learning data-science big-data aws tensorflow theano caffe scikit-learn kaggle spark mapreduce hadoop matplotlib pandas numpy scipy kerasdev-setup is geared to be more of an organized reference of various developer tools.You're not meant to install everything.
mac vim sublime-text bash iterm2 spark aws cloud android-development cli git mysql postgresql mongodb redis elasticsearch nodejsThis is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, from basic to advanced, by using the Python language. If Python is not your language, and it is R, you may want to have a look at our R on Apache Spark (SparkR) notebooks instead. Additionally, if your are interested in being introduced to some basic Data Science Engineering, you might find these series of tutorials interesting. There we explain different concepts and applications using Python and R.
spark pyspark data-analysis mllib ipython-notebook notebook ipython data-science machine-learning big-data bigdataCode examples that show how to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark 1.1+ while using Apache Avro as the data serialization format. Take a look at the Kafka Streams code examples at https://github.com/confluentinc/examples.
apache-kafka kafka apache-storm storm spark apache-spark integration avro apache-avroSparkMD5 is a fast md5 implementation of the MD5 algorithm. This script is based in the JKM md5 library which is the fastest algorithm around. This is most suitable for browser usage, because nodejs version might be faster. NOTE: Please disable Firebug while performing the test! Firebug consumes a lot of memory and CPU and slows the test by a great margin.
md5 fast spark incrementalLearn and understand Docker technologies, with real DevOps practice!
docker book cloud-computing container kubernetes swarm mesos spark devopsApache Spark is a general purpose parallel computational engine for analytics at scale. At its core, it has a batch design center and is capable of working with disparate data sources. While this provides rich unified access to data, this can also be quite inefficient and expensive. Analytic processing requires massive data sets to be repeatedly copied and data to be reformatted to suit Spark. In many cases, it ultimately fails to deliver the promise of interactive analytic performance. For instance, each time an aggregation is run on a large Cassandra table, it necessitates streaming the entire table into Spark to do the aggregation. Caching within Spark is immutable and results in stale insight. At SnappyData, we take a very different approach. SnappyData fuses a low latency, highly available in-memory transactional database (GemFireXD) into Spark with shared memory management and optimizations. Data in the highly available in-memory store is laid out using the same columnar format as Spark (Tungsten). All query engine operators are significantly more optimized through better vectorization and code generation. The net effect is, an order of magnitude performance improvement when compared to native Spark caching, and more than two orders of magnitude better Spark performance when working with external data sources.
snappydata spark memory-database analytics stream transaction scaleAre you looking for RSparkling? It's README is available here. The Sparkling Water is developed in multiple parallel branches. Each branch corresponds to a Spark major release (e.g., branch rel-2.3 provides implementation of Sparkling Water for Spark 2.3).
h2o spark machine-learning integration pysparkling rsparkling api develA distributed deep learning library for Apache Spark.
deep-learning spark neural-network big-data hadoop keras aiKubernetes中文指南/云原生应用架构实践手册 - https://jimmysong.io/kubernetes-handbook
kubernetes docker gitbook containers pdf cloud-computing cloud-native microservices service-mesh spark big-data faas serverless handbookspark-jobserver provides a RESTful interface for submitting and managing Apache Spark jobs, jars, and job contexts. This repo contains the complete Spark job server project, including unit tests and deploy scripts. It was originally started at Ooyala, but this is now the main development repo. Other useful links: Troubleshooting, cluster, YARN client, YARN on EMR, Mesos, JMX tips.
spark rest-api spark-jobserverThis project aims at a minimal benchmark for scalability, speed and accuracy of commonly used implementations of a few machine learning algorithms. The target of this study is binary classification with numeric and categorical inputs (of limited cardinality i.e. not very sparse) and no missing data, perhaps the most common problem in business applications (e.g. credit scoring, fraud detection or churn prediction). If the input matrix is of n x p, n is varied as 10K, 100K, 1M, 10M, while p is ~1K (after expanding the categoricals into dummy variables/one-hot encoding). This particular type of data structure/size (the largest) stems from this author's interest in some particular business applications. Note: While a large part of this benchmark was done in Spring 2015 reflecting the state of ML implementations at that time, this repo is being updated if I see significant changes in implementations or new implementations have become widely available (e.g. lightgbm). Also, please find a summary of the progress and learnings from this benchmark at the end of this repo.
machine-learning data-science r gradient-boosting-machine random-forest deep-learning xgboost h2o sparkEach model is built into a separate Docker image with the appropriate Python, C++, and Java/Scala Runtime Libraries for training or prediction. Use the same Docker Image from Local Laptop to Production to avoid dependency surprises.
machine-learning artificial-intelligence tensorflow kubernetes elasticsearch cassandra ipython spark kafka netflixoss presto airflow pipeline jupyter-notebook zeppelin docker redis neural-network gpu microservices酷玩 Spark: Spark 源代码解析、Spark 类库等
spark spark-streaming structured-streaming sparkcore apache-sparkTransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library written in Scala that runs on top of Spark. It was developed with a focus on accelerating machine learning developer productivity through machine learning automation, and an API that enforces compile-time type-safety, modularity, and reuse. Through automation, it achieves accuracies close to hand-tuned models with almost 100x reduction in time. Skip to Quick Start and Documentation.
ml automl transformations estimators dsl pipelines machine-learning salesforce einstein features feature-engineering spark sparkml ai automated-machine-learning transmogrification transmogrify structured-data transformersThis project is not actively maintained anymore please see Seldon Core. Seldon Server is a machine learning platform that helps your data science team deploy models into production.
machine-learning deep-learning deployment kubernetes docker microservices spark kafka kafka-streams tensorflow cloud aws gcp azure seldon recommender-system recommendation-engine prediction
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
Add Projects.