Spark - Fast Cluster Computing

  •        4503

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop.

http://spark.incubator.apache.org/

Tags
Implementation
License
Platform

   




Related Projects

StratoSphere - Cloud Computing Framework for Big Data Analytics


The Stratosphere System is an open-source cluster/cloud computing framework for Big Data analytics. It comprises of An extensible higher level language (Meteor) to quickly compose queries for common and recurring use cases, A parallel programming model (PACT, an extension of MapReduce) to run user-defined operations, An efficient massively parallel runtime (Nephele) for fault tolerant execution of acyclic data flows.

Hadoop Common


Apache Hadoop is a framework for running applications on large clusters built of commodity hardware. Hadoop common supports other Hadoop subprojects

snappydata - SnappyData: OLTP + OLAP Database built on Apache Spark


SnappyData is a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction processing) and OLAP (online analytical processing) in a single integrated cluster. We realize this platform through a seamless integration of Apache Spark (as a big data computational engine) with GemFire XD (as an in-memory transactional store with scale-out SQL semantics).

spark-cluster-deployment - Automates Spark standalone cluster tasks with Puppet and Fabric.


Apache Spark is a research project for distributed computing which interacts with HDFS and heavily utilizes in-memory caching. Spark 1.0.0 can be deployed to traditional cloud and job management services such as EC2, Mesos, or Yarn. Further, Spark's standalone cluster mode enables Spark to run on other servers without installing other job management services.However, configuring and submitting applications to a Spark 1.0.0 standalone cluster currently requires files to be synchronized across the entire cluster, including the Spark installation directory. This project utilizes Fabric and Puppet to further automate the Spark standalone cluster. The Puppet scripts are MIT-licensed from stefanvanwouw/puppet-spark and wikimedia/puppet-cdh4.

Apache REEF - a stdlib for Big Data


Apache REEF (Retainable Evaluator Execution Framework) is a library for developing portable applications for cluster resource managers such as Apache Hadoop YARN or Apache Mesos. For example, Microsoft Azure Stream Analytics is built on REEF and Hadoop.



GeoMesa - Suite of tools for working with big geo-spatial data in a distributed fashion


GeoMesa is an open-source, distributed, spatio-temporal database built on a number of distributed cloud data storage systems, including Accumulo, HBase, Cassandra, and Kafka. Leveraging a highly parallelized indexing strategy, GeoMesa aims to provide as much of the spatial querying and data manipulation to Accumulo as PostGIS does to Postgres.

snappy-spark - Apache Spark with SnappyData extensions


Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing.

timely-security-analytics - Demo code for the Timely Security Analytics and Analysis 2015 Re:Invent presentation


This page will be updated with a link to the video of the talk, when it is available.Note that these commands will take some time to run as they load your CloudTrail data from S3 and store it in-memory on the Spark cluster. Run the sample query again and you'll see the speed up that the in-memory caching provides. 5. Run any SQL query you want over the data, e.g.

TensorFlowOnSpark - TensorFlowOnSpark brings TensorFlow programs onto Apache Spark clusters


TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters. By combining salient features from deep learning framework TensorFlow and big-data frameworks Apache Spark and Apache Hadoop, TensorFlowOnSpark enables distributed deep learning on a cluster of GPU and CPU servers.TensorFlowOnSpark was developed by Yahoo for large-scale distributed deep learning on our Hadoop clusters in Yahoo's private cloud.

real-time-analytics-spark-streaming - A solution describing data-processing design pattern for streaming data through Kinesis and Spark Streaming at real-time


We have published an updated solution at https://aws.amazon.com/answers/big-data/real-time-analytics-spark-streaming/ This makes the solution easier to deploy.The readme has been updated to the latest installation instructions, please refer to the solution page for additional information. We will maintain this repo with the latest template and code.A solution describing data-processing design pattern for streaming data through Kinesis and Spark Streaming at real-time.

Apache Tajo - A big data warehouse system on Hadoop


Apache Tajo is a robust big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL (extract-transform-load process) on large-data sets stored on HDFS (Hadoop Distributed File System) and other data sources.

Redisson - Redis based In-Memory Data Grid for Java


Redisson - distributed Java objects and services (Set, Multimap, SortedSet, Map, List, Queue, BlockingQueue, Deque, BlockingDeque, Semaphore, Lock, AtomicLong, Map Reduce, Publish / Subscribe, Bloom filter, Spring Cache, Executor service, Tomcat Session Manager, Scheduler service, JCache API) on top of Redis server. Rich Redis client.

Shark - Hive on Spark


Shark is an open source distributed SQL query engine for Hadoop data. It brings state-of-the-art performance and advanced analytics to Hive users. It runs Hive queries up to 100x faster in memory, or 10x on disk. it is a large-scale data warehouse system for Spark designed to be compatible with Apache Hive.

spark-parquet-thrift-example - Example Spark project using Parquet as a columnar store with Thrift objects


Apache Spark is a research project for distributed computing which interacts with HDFS and heavily utilizes in-memory caching. Modern datasets contain hundreds or thousands of columns and are too large to cache all the columns in Spark's memory, so Spark has to resort to paging to disk. The disk paging penalty can be lessened or removed if the Spark application only interacts with a subset of the columns in the overall database by using a columnar store database such as Parquet, which will only load the specified columns of data into a Spark RDD.Matt Massie's example uses Parquet with Avro for data serialization and filters loading based on an equality predicate, but does not show how to load only a subset of columns. This project shows a complete Scala/sbt project using Thrift for data serialization and shows how to load columnar subsets.

spindle - Next-generation web analytics processing with Scala, Spark, and Parquet.


Spindle is Brandon Amos' 2014 summer internship project with Adobe Research and is not under active development.Analytics platforms such as Adobe Analytics are growing to process petabytes of data in real-time. Delivering responsive interfaces querying this amount of data is difficult, and there are many distributed data processing technologies such as Hadoop MapReduce, Apache Spark, Apache Drill, and Cloudera Impala to build low-latency query systems.

CaffeOnSpark


CaffeOnSpark brings deep learning to Hadoop and Spark clusters. By combining salient features from deep learning framework Caffe and big-data frameworks Apache Spark and Apache Hadoop, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers.As a distributed extension of Caffe, CaffeOnSpark supports neural network model training, testing, and feature extraction. Caffe users can now perform distributed learning using their existing LMDB data files and minorly adjusted network configuration (as illustrated).

Hypertable - A high performance, scalable, distributed storage and processing system for structured


Hypertable is based on Google's Bigtable Design, which is a proven scalable design that powers hundreds of Google services. Many of the current scalable NoSQL database offerings are based on a hash table design which means that the data they manage is not kept physically ordered. Hypertable keeps data physically sorted by a primary key and it is well suited for Analytics.

elasticsearch-hadoop - :elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop


Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive, Apache Pig, Apache Spark and Apache Storm.See project page and documentation for detailed information.

apex-core - Mirror of Apache Apex core


Apache Apex is a unified platform for big data stream and batch processing. Use cases include ingestion, ETL, real-time analytics, alerts and real-time actions. Apex is a Hadoop-native YARN implementation and uses HDFS by default. It simplifies development and productization of Hadoop applications by reducing time to market. Key features include Enterprise Grade Operability with Fault Tolerance, State Management, Event Processing Guarantees, No Data Loss, In-memory Performance & Scalability and Native Window Support.Please visit the documentation section.

Apache Tez - A Framework for YARN-based, Data Processing Applications In Hadoop


Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.