Displaying 1 to 20 from 71 results

kafka-storm-starter - Code examples that show to integrate Apache Kafka 0

  •    Scala

Code examples that show how to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark 1.1+ while using Apache Avro as the data serialization format. Take a look at the Kafka Streams code examples at https://github.com/confluentinc/examples.

Oryx 2 - Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

  •    Java

The Oryx open source project provides infrastructure for lambda-architecture applications on top of Spark, Spark Streaming and Kafka. On this, it provides further support for real-time, large scale machine learning, and end-to-end applications of this support for common machine learning use cases, like recommendations, clustering, classification and regression.

spark - .NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

  •    CSharp

.NET for Apache Spark provides high performance APIs for using Apache Spark from C# and F#. With these .NET APIs, you can access the most popular Dataframe and SparkSQL aspects of Apache Spark, for working with structured data, and Spark Structured Streaming, for working with streaming data. .NET for Apache Spark is compliant with .NET Standard - a formal specification of .NET APIs that are common across .NET implementations. This means you can use .NET for Apache Spark anywhere you write .NET code allowing you to reuse all the knowledge, skills, code, and libraries you already have as a .NET developer.




mlflow - Open source platform for the machine learning lifecycle

  •    Python

MLflow requires conda to be on the PATH for the projects feature. Nightly snapshots of MLflow master are also available here.

spark-notebook - Interactive and Reactive Data Science using Scala and Spark.

  •    Javascript

The Spark Notebook is the open source notebook aimed at enterprise environments, providing Data Scientists and Data Engineers with an interactive web-based editor that can combine Scala code, SQL queries, Markup and JavaScript in a collaborative manner to explore, analyse and learn from massive data sets. The Spark Notebook allows performing reproducible analysis with Scala, Apache Spark and the Big Data ecosystem.

lakeFS - Git-like capabilities for your object storage

  •    Go

lakeFS is an open source layer that delivers resilience and manageability to object-storage based data lakes. With lakeFS you can build repeatable, atomic and versioned data lake operations - from complex ETL jobs to data science and analytics.

Mobius - C# and F# language binding and extensions to Apache Spark

  •    CSharp

Mobius provides C# language binding to Apache Spark enabling the implementation of Spark driver program and data processing operations in the languages supported in the .NET framework like C# or F#.For more code samples, refer to Mobius\examples directory or Mobius\csharp\Samples directory.


sparklyr - R interface for Apache Spark

  •    R

If you use the RStudio IDE, you should also download the latest preview release of the IDE which includes several enhancements for interacting with Spark (see the RStudio IDE section below for more details). The returned Spark connection (sc) provides a remote dplyr data source to the Spark cluster.

mastering-apache-spark-book - Mastering Apache Spark 2

  •    

For the first time I’m using AsciiDoc to write a doc that is ultimately supposed to become the book about Apache Spark. While on writing route, I’m also aiming at mastering the git(hub) flow to write the book as described in Living the Future of Technical Writing (with pull requests for chapters, action items to show progress of each branch and such).

awesome-spark - A curated list of awesome Apache Spark packages and resources.

  •    

A curated list of awesome Apache Spark packages and resources. Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance (Wikipedia 2017).

dist-keras - Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark

  •    Python

Distributed Deep Learning with Apache Spark and Keras. Distributed Keras is a distributed deep learning framework built op top of Apache Spark and Keras, with a focus on "state-of-the-art" distributed optimization algorithms. We designed the framework in such a way that a new distributed optimizer could be implemented with ease, thus enabling a person to focus on research. Several distributed methods are supported, such as, but not restricted to, the training of ensembles and models using data parallel methods.

wirbelsturm - Wirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data related infrastructure

  •    Shell

Wirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data related infrastructure. Wirbelsturm's goal is to make tasks such as "I want to deploy a multi-node Storm cluster" simple, easy, and fun.

sparkle - Haskell on Apache Spark.

  •    Haskell

sparkle [spär′kəl]: a library for writing resilient analytics applications in Haskell that scale to thousands of nodes, using Spark and the rest of the Apache ecosystem under the hood. See this blog post for the details. There is experimental support for bazel. This mechanism doesn't require executing sparkle package.

flintrock - A command-line tool for launching Apache Spark clusters.

  •    Python

Flintrock is a command-line tool for launching Apache Spark clusters. Though Flintrock hasn't made a 1.0 release yet, it's fairly stable. Expect some minor but nonethless backwards incompatible changes as Flintrock reaches formal stability via a 1.0 release.

Optimus - :truck: Agile Data Science Workflows made easy with Python and Spark.

  •    Python

Optimus is the missing framework to profile, clean, process and do ML in a distributed fashion using Apache Spark(PySpark). You can go to the 10 minutes to Optimus notebook where you can find the basic to start working.

Agile_Data_Code_2 - Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition

  •    Jupyter

Like my work? I am Principal Consultant at Data Syndrome, a consultancy offering assistance and training with building full-stack analytics products, applications and systems. Find us on the web at datasyndrome.com. There is now a video course using code from chapter 8, Realtime Predictive Analytics with Kafka, PySpark, Spark MLlib and Spark Streaming. Check it out now at datasyndrome.com/video.






We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.