Code examples that show how to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark 1.1+ while using Apache Avro as the data serialization format. Take a look at the Kafka Streams code examples at https://github.com/confluentinc/examples.
apache-kafka kafka apache-storm storm spark apache-spark integration avro apache-avroThe Oryx open source project provides infrastructure for lambda-architecture applications on top of Spark, Spark Streaming and Kafka. On this, it provides further support for real-time, large scale machine learning, and end-to-end applications of this support for common machine learning use cases, like recommendations, clustering, classification and regression.
lambda lambda-architecture oryx apache-spark machine-learning kafka classification clustering酷玩 Spark: Spark 源代码解析、Spark 类库等
spark spark-streaming structured-streaming sparkcore apache-spark.NET for Apache Spark provides high performance APIs for using Apache Spark from C# and F#. With these .NET APIs, you can access the most popular Dataframe and SparkSQL aspects of Apache Spark, for working with structured data, and Spark Structured Streaming, for working with streaming data. .NET for Apache Spark is compliant with .NET Standard - a formal specification of .NET APIs that are common across .NET implementations. This means you can use .NET for Apache Spark anywhere you write .NET code allowing you to reuse all the knowledge, skills, code, and libraries you already have as a .NET developer.
spark analytics bigdata spark-streaming spark-sql machine-learning fsharp dotnet-core dotnet-standard streaming apache-spark tpcds tpch azure hdinsight databricks emr microsoftMLflow requires conda to be on the PATH for the projects feature. Nightly snapshots of MLflow master are also available here.
machine-learning ai apache-spark ml model-management mlflowThe Spark Notebook is the open source notebook aimed at enterprise environments, providing Data Scientists and Data Engineers with an interactive web-based editor that can combine Scala code, SQL queries, Markup and JavaScript in a collaborative manner to explore, analyse and learn from massive data sets. The Spark Notebook allows performing reproducible analysis with Scala, Apache Spark and the Big Data ecosystem.
data-science reactive spark apache-spark notebooklakeFS is an open source layer that delivers resilience and manageability to object-storage based data lakes. With lakeFS you can build repeatable, atomic and versioned data lake operations - from complex ETL jobs to data science and analytics.
apache-spark aws-s3 google-cloud-storage data-engineering data-lake object-storage datalake hadoop-filesystem data-quality data-versioning azure-blob-storage apache-sparksql git-for-data lakefs datalakesMobius provides C# language binding to Apache Spark enabling the implementation of Spark driver program and data processing operations in the languages supported in the .NET framework like C# or F#.For more code samples, refer to Mobius\examples directory or Mobius\csharp\Samples directory.
spark apache-spark rdd dataframe dstream dataset streaming mobius kafka-streaming spark-streaming fsharp bigdata mapreduce eventhubs near-real-timeIf you use the RStudio IDE, you should also download the latest preview release of the IDE which includes several enhancements for interacting with Spark (see the RStudio IDE section below for more details). The returned Spark connection (sc) provides a remote dplyr data source to the Spark cluster.
r rstats apache-spark machine-learning r-package dplyr sparklyr dbiFor the first time I’m using AsciiDoc to write a doc that is ultimately supposed to become the book about Apache Spark. While on writing route, I’m also aiming at mastering the git(hub) flow to write the book as described in Living the Future of Technical Writing (with pull requests for chapters, action items to show progress of each branch and such).
apache-spark spark shufflelikepro bookA curated list of awesome Apache Spark packages and resources. Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance (Wikipedia 2017).
apache-spark pyspark awesome sparkrDistributed Deep Learning with Apache Spark and Keras. Distributed Keras is a distributed deep learning framework built op top of Apache Spark and Keras, with a focus on "state-of-the-art" distributed optimization algorithms. We designed the framework in such a way that a new distributed optimizer could be implemented with ease, thus enabling a person to focus on research. Several distributed methods are supported, such as, but not restricted to, the training of ensembles and models using data parallel methods.
machine-learning deep-learning apache-spark data-parallelism distributed-optimizers keras optimization-algorithms tensorflow data-science hadoopWirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data related infrastructure. Wirbelsturm's goal is to make tasks such as "I want to deploy a multi-node Storm cluster" simple, easy, and fun.
vagrant puppet kafka apache-kafka storm apache-storm spark apache-sparksparkle [spär′kəl]: a library for writing resilient analytics applications in Haskell that scale to thousands of nodes, using Spark and the rest of the Apache ecosystem under the hood. See this blog post for the details. There is experimental support for bazel. This mechanism doesn't require executing sparkle package.
apache-spark spark haskell analyticsFlintrock is a command-line tool for launching Apache Spark clusters. Though Flintrock hasn't made a 1.0 release yet, it's fairly stable. Expect some minor but nonethless backwards incompatible changes as Flintrock reaches formal stability via a 1.0 release.
apache-spark ec2 apache-spark-cluster orchestration spark-ec2Optimus is the missing framework to profile, clean, process and do ML in a distributed fashion using Apache Spark(PySpark). You can go to the 10 minutes to Optimus notebook where you can find the basic to start working.
spark pyspark data-wrangling bigdata big-data-cleaning data-science cleansing data-cleansing data-cleaner apache-spark data-transformationLike my work? I am Principal Consultant at Data Syndrome, a consultancy offering assistance and training with building full-stack analytics products, applications and systems. Find us on the web at datasyndrome.com. There is now a video course using code from chapter 8, Realtime Predictive Analytics with Kafka, PySpark, Spark MLlib and Spark Streaming. Check it out now at datasyndrome.com/video.
data-syndrome data data-science analytics apache-spark apache-kafka kafka spark predictive-analytics machine-learning machine-learning-algorithms airflow python-3 python3 amazon-ec2 agile-data agile-data-science vagrant amazon-web-servicesA recommender system for discovering GitHub repos
recommender-system machine-learning apache-spark feature-engineering elasticsearchLibraries to connect (and demonstrate) Azure Event Hubs with Apache Spark.
spark spark-streaming azure eventhubs real-time streaming continuous apache apache-spark microsoftAzure Eventhubs Connector for Spark Streaming Applications
spark spark-streaming azure eventhubs real-time streaming continuous apache apache-spark microsoft
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
Add Projects.