We have collection of more than 1 Million open source products ranging from Enterprise product to
small libraries in all platforms. We aggregate information from all open source repositories.
Search and find the best for your needs. Check out projects section.
Code examples that show how to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark 1.1+ while using Apache Avro as the data serialization format. Take a look at the Kafka Streams code examples at https://github.com/confluentinc/examples.
The Oryx open source project provides infrastructure for lambda-architecture applications on top of Spark, Spark Streaming and Kafka. On this, it provides further support for real-time, large scale machine learning, and end-to-end applications of this support for common machine learning use cases, like recommendations, clustering, classification and regression.
.NET for Apache Spark provides high performance APIs for using Apache Spark from C# and F#. With these .NET APIs, you can access the most popular Dataframe and SparkSQL aspects of Apache Spark, for working with structured data, and Spark Structured Streaming, for working with streaming data. .NET for Apache Spark is compliant with .NET Standard - a formal specification of .NET APIs that are common across .NET implementations. This means you can use .NET for Apache Spark anywhere you write .NET code allowing you to reuse all the knowledge, skills, code, and libraries you already have as a .NET developer.
lakeFS is an open source layer that delivers resilience and manageability to object-storage based data lakes. With lakeFS you can build repeatable, atomic and versioned data lake operations - from complex ETL jobs to data science and analytics.
Mobius provides C# language binding to Apache Spark enabling the implementation of Spark driver program and data processing operations in the languages supported in the .NET framework like C# or F#.For more code samples, refer to Mobius\examples directory or Mobius\csharp\Samples directory.
If you use the RStudio IDE, you should also download the latest preview release of the IDE which includes several enhancements for interacting with Spark (see the RStudio IDE section below for more details). The returned Spark connection (sc) provides a remote dplyr data source to the Spark cluster.
For the first time I’m using AsciiDoc to write a doc that is ultimately supposed to become the book about Apache Spark. While on writing route, I’m also aiming at mastering the git(hub) flow to write the book as described in Living the Future of Technical Writing (with pull requests for chapters, action items to show progress of each branch and such).
A curated list of awesome Apache Spark packages and resources. Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance (Wikipedia 2017).
Distributed Deep Learning with Apache Spark and Keras. Distributed Keras is a distributed deep learning framework built op top of Apache Spark and Keras, with a focus on "state-of-the-art" distributed optimization algorithms. We designed the framework in such a way that a new distributed optimizer could be implemented with ease, thus enabling a person to focus on research. Several distributed methods are supported, such as, but not restricted to, the training of ensembles and models using data parallel methods.
Wirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data related infrastructure. Wirbelsturm's goal is to make tasks such as "I want to deploy a multi-node Storm cluster" simple, easy, and fun.
sparkle [spär′kəl]: a library for writing resilient analytics applications in Haskell that scale to thousands of nodes, using Spark and the rest of the Apache ecosystem under the hood. See this blog post for the details. There is experimental support for bazel. This mechanism doesn't require executing sparkle package.
Flintrock is a command-line tool for launching Apache Spark clusters. Though Flintrock hasn't made a 1.0 release yet, it's fairly stable. Expect some minor but nonethless backwards incompatible changes as Flintrock reaches formal stability via a 1.0 release.
Optimus is the missing framework to profile, clean, process and do ML in a distributed fashion using Apache Spark(PySpark). You can go to the 10 minutes to Optimus notebook where you can find the basic to start working.
Like my work? I am Principal Consultant at Data Syndrome, a consultancy offering assistance and training with building full-stack analytics products, applications and systems. Find us on the web at datasyndrome.com. There is now a video course using code from chapter 8, Realtime Predictive Analytics with Kafka, PySpark, Spark MLlib and Spark Streaming. Check it out now at datasyndrome.com/video.