cdap - An open source framework for building data analytic applications.

  •        17

Data Application Platform for Hadoop

https://docs.cdap.io
https://github.com/caskdata/cdap

Dependencies:

org.slf4j:slf4j-api:1.7.5
org.slf4j:log4j-over-slf4j:1.7.5
org.slf4j:jcl-over-slf4j:1.7.5
org.slf4j:jul-to-slf4j:1.7.5
co.cask.common:common-cli:0.8.0
javax.ws.rs:javax.ws.rs-api:2.0
org.apache.twill:twill-api:0.13.0
co.cask.http:netty-http:1.1.0-SNAPSHOT
org.apache.tephra:tephra-api:0.15.0-incubating
org.apache.tephra:tephra-core:0.15.0-incubating
org.apache.tephra:tephra-hbase-compat-0.96:0.15.0-incubating
org.apache.tephra:tephra-hbase-compat-0.98:0.15.0-incubating
org.apache.tephra:tephra-hbase-compat-1.0-cdh:0.15.0-incubating
org.apache.tephra:tephra-hbase-compat-1.0:0.15.0-incubating
org.apache.tephra:tephra-hbase-compat-1.1:0.15.0-incubating
com.google.guava:guava:13.0.1
com.google.inject:guice:4.0
com.google.inject.extensions:guice-assistedinject:4.0
com.google.inject.extensions:guice-multibindings:4.0
com.google.inject.extensions:guice-servlet:4.0
org.jboss.resteasy:resteasy-servlet-initializer:3.0.8.Final
org.jboss.resteasy:resteasy-guice:3.0.8.Final
org.xerial.snappy:snappy-java:1.1.1.7
jline:jline:2.12
org.mockito:mockito-core:1.9.5
com.googlecode.concurrent-trees:concurrent-trees:2.4.0
com.google.code.findbugs:jsr305:2.0.1
com.google.code.gson:gson:2.2.4
com.jcraft:jsch:0.1.54
org.apache.avro:avro:1.6.2
org.apache.avro:avro-ipc:1.6.2
org.apache.avro:avro-mapred:1.6.2
org.apache.flume:flume-ng-sdk:1.2.0
org.apache.flume:flume-ng-core:1.2.0
org.apache.zookeeper:zookeeper:3.4.5
io.netty:netty-buffer:4.1.16.Final
io.netty:netty-codec-http:4.1.16.Final
io.netty:netty-all:4.1.16.Final
ch.qos.logback:logback-core:1.0.9
ch.qos.logback:logback-classic:1.0.9
org.ow2.asm:asm-all:5.0.3
org.apache.kafka:kafka_2.10:0.8.2.2
org.iq80.leveldb:leveldb:0.6
commons-codec:commons-codec:1.6
commons-cli:commons-cli:1.2
org.apache.twill:twill-common:0.13.0
org.apache.twill:twill-core:0.13.0
org.apache.twill:twill-discovery-api:0.13.0
org.apache.twill:twill-discovery-core:0.13.0
org.apache.twill:twill-yarn:0.13.0
org.apache.twill:twill-zookeeper:0.13.0
org.apache.thrift:libthrift:0.9.3
org.apache.hadoop:hadoop-common:2.3.0
org.apache.hadoop:hadoop-yarn-api:2.3.0
org.apache.hadoop:hadoop-yarn-client:2.3.0
org.apache.hadoop:hadoop-yarn-common:2.3.0
org.apache.hadoop:hadoop-hdfs:2.3.0
org.apache.hadoop:hadoop-mapreduce-client-app:2.3.0
org.apache.hadoop:hadoop-mapreduce-client-core:2.3.0
org.apache.hadoop:hadoop-mapreduce-client-common:2.3.0
org.apache.hbase:hbase-common:0.98.6.1-hadoop2
org.apache.hbase:hbase-client:0.98.6.1-hadoop2
org.apache.hbase:hbase-protocol:0.98.6.1-hadoop2
org.apache.hbase:hbase-server:0.98.6.1-hadoop2
io.thekraken:grok:0.1.0
org.apache.shiro:shiro-core:1.2.1
org.apache.shiro:shiro-guice:1.2.1
mysql:mysql-connector-java:5.1.21
org.mortbay.jetty:jetty:6.1.22
org.mortbay.jetty:jetty-management:6.1.22
org.quartz-scheduler:quartz:2.2.0
org.quartz-scheduler:quartz-jobs:2.2.0
com.ning:async-http-client:1.7.18
org.eclipse.jetty:jetty-server:8.1.15.v20140411
org.eclipse.jetty:jetty-security:8.1.15.v20140411
org.eclipse.jetty:jetty-util:8.1.15.v20140411
org.eclipse.jetty:jetty-jaspi:8.1.15.v20140411
org.eclipse.jetty:jetty-plus:8.1.15.v20140411
org.apache.geronimo.components:geronimo-jaspi:2.0.0
org.apache.hive:hive-jdbc:1.2.1
org.apache.hive:hive-metastore:1.2.1
org.apache.hive:hive-service:1.2.1
org.apache.hive:hive-exec:1.2.1
javax.servlet:javax.servlet-api:3.0.1
it.unimi.dsi:fastutil:6.5.6
org.apache.spark:spark-core_2.10:1.6.1
org.apache.spark:spark-core_2.11:2.1.3
org.apache.spark:spark-sql_2.10:1.6.1
org.apache.spark:spark-sql_2.11:2.1.3
org.apache.spark:spark-streaming_2.10:1.6.1
org.apache.spark:spark-streaming_2.11:2.1.3
org.apache.spark:spark-mllib_2.10:1.6.1
org.apache.spark:spark-mllib_2.11:2.1.3
org.apache.spark:spark-repl_2.10:1.6.1
org.apache.spark:spark-repl_2.11:2.1.3
co.cask.cdap:cdap-authentication-client:1.2.0
co.cask.common:common-http:0.11.0
co.cask.common:common-io:0.11.0
org.apache.hadoop:hadoop-mapreduce-client-jobclient:2.3.0
org.twitter4j:twitter4j-core:4.0.3
org.twitter4j:twitter4j-stream:4.0.3
javax.mail:mail:1.4.1
org.apache.commons:commons-compress:1.18
org.apache.tez:tez-api:0.8.4

Tags
Implementation
License
Platform

   




Related Projects

Mobius - C# and F# language binding and extensions to Apache Spark

  •    CSharp

Mobius provides C# language binding to Apache Spark enabling the implementation of Spark driver program and data processing operations in the languages supported in the .NET framework like C# or F#.For more code samples, refer to Mobius\examples directory or Mobius\csharp\Samples directory.

snappydata - SnappyData - The Spark Database. Stream, Transact, Analyze, Predict in one cluster

  •    Scala

Apache Spark is a general purpose parallel computational engine for analytics at scale. At its core, it has a batch design center and is capable of working with disparate data sources. While this provides rich unified access to data, this can also be quite inefficient and expensive. Analytic processing requires massive data sets to be repeatedly copied and data to be reformatted to suit Spark. In many cases, it ultimately fails to deliver the promise of interactive analytic performance. For instance, each time an aggregation is run on a large Cassandra table, it necessitates streaming the entire table into Spark to do the aggregation. Caching within Spark is immutable and results in stale insight. At SnappyData, we take a very different approach. SnappyData fuses a low latency, highly available in-memory transactional database (GemFireXD) into Spark with shared memory management and optimizations. Data in the highly available in-memory store is laid out using the same columnar format as Spark (Tungsten). All query engine operators are significantly more optimized through better vectorization and code generation. The net effect is, an order of magnitude performance improvement when compared to native Spark caching, and more than two orders of magnitude better Spark performance when working with external data sources.

Apache Beam - Unified model for defining both batch and streaming data-parallel processing pipelines

  •    Java

Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.

spark-movie-lens - An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset

  •    Jupyter

This Apache Spark tutorial will guide you step-by-step into how to use the MovieLens dataset to build a movie recommender using collaborative filtering with Spark's Alternating Least Saqures implementation. It is organised in two parts. The first one is about getting and parsing movies and ratings data into Spark RDDs. The second is about building and using the recommender and persisting it for later use in our on-line recommender system. This tutorial can be used independently to build a movie recommender model based on the MovieLens dataset. Most of the code in the first part, about how to use ALS with the public MovieLens dataset, comes from my solution to one of the exercises proposed in the CS100.1x Introduction to Big Data with Apache Spark by Anthony D. Joseph on edX, that is also publicly available since 2014 at Spark Summit. Starting from there, I've added with minor modifications to use a larger dataset, then code about how to store and reload the model for later use, and finally a web service using Flask.

Spark - Cross-platform real-time collaboration client optimized for business and organizations.

  •    Java

Spark is an Open Source, cross-platform IM client optimized for businesses and organizations. It features built-in support for group chat, telephony integration, and strong security. It also offers a great end-user experience with features like in-line spell checking, group chat room bookmarks, and tabbed conversations. Combined with the Openfire server, Spark is the easiest and best alternative to using un-secure public IM networks.


spring-hadoop - Spring for Apache Hadoop is a framework for application developers to take advantage of the features of both Hadoop and Spring

  •    Java

The Spring for Apache Hadoop project provides extensions to Spring, Spring Batch, and Spring Integration to build manageable and robust pipeline solutions around Hadoop.Spring for Apache Hadoop extends Spring Batch by providing support for reading from and writing to HDFS, running various types of Hadoop jobs (Java MapReduce, Streaming, Hive, Spark, Pig) and using HBase. An important goal is to provide excellent support for non-Java based developers to be productive using Spring Hadoop and not have to write any Java code to use the core feature set.

Agile_Data_Code_2 - Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition

  •    Jupyter

Like my work? I am Principal Consultant at Data Syndrome, a consultancy offering assistance and training with building full-stack analytics products, applications and systems. Find us on the web at datasyndrome.com. There is now a video course using code from chapter 8, Realtime Predictive Analytics with Kafka, PySpark, Spark MLlib and Spark Streaming. Check it out now at datasyndrome.com/video.

Gimel - PayPal's Big Data Processing Framework

  •    Scala

Gimel provides unified Data API to access data from any storage like HDFS, GS, Alluxio, Hbase, Aerospike, BigQuery, Druid, Elastic, Teradata, Oracle, MySQL, etc.

shc - The Apache Spark - Apache HBase Connector is a library to support Spark accessing HBase table as external data source or sink

  •    Scala

The Apache Spark - Apache HBase Connector is a library to support Spark accessing HBase table as external data source or sink. With it, user can operate HBase with Spark-SQL on DataFrame and DataSet level. With the DataFrame and DataSet support, the library leverages all the optimization techniques in catalyst, and achieves data locality, partition pruning, predicate pushdown, Scanning and BulkGet, etc.

flint - A Time Series Library for Apache Spark

  •    Scala

The ability to analyze time series data at scale is critical for the success of finance and IoT applications based on Spark. Flint is Two Sigma's implementation of highly optimized time series operations in Spark. It performs truly parallel and rich analyses on time series data by taking advantage of the natural ordering in time series data to provide locality-based optimizations. Flint is an open source library for Spark based around the TimeSeriesRDD, a time series aware data structure, and a collection of time series utility and analysis functions that use TimeSeriesRDDs. Unlike DataFrame and Dataset, Flint's TimeSeriesRDDs can leverage the existing ordering properties of datasets at rest and the fact that almost all data manipulations and analysis over these datasets respect their temporal ordering properties. It differs from other time series efforts in Spark in its ability to efficiently compute across panel data or on large scale high frequency data.

Oryx 2 - Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

  •    Java

The Oryx open source project provides infrastructure for lambda-architecture applications on top of Spark, Spark Streaming and Kafka. On this, it provides further support for real-time, large scale machine learning, and end-to-end applications of this support for common machine learning use cases, like recommendations, clustering, classification and regression.

Spark - A simple expressive web framework for java

  •    Java

Spark is a micro framework for creating web applications in Kotlin and Java 8 with minimal effort. It is a simple and expressive Java/Kotlin web framework DSL built for rapid development. Sparks intention is to provide an alternative for Kotlin/Java developers that want to develop their web applications as expressive as possible and with minimal boilerplate. With a clear philosophy Spark is designed not only to make you more productive, but also to make your code better under the influence of Spark’s sleek, declarative and expressive syntax.

HiBench - HiBench is a big data benchmark suite.

  •    Java

HiBench is a big data benchmark suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilizations. It contains a set of Hadoop, Spark and streaming workloads, including Sort, WordCount, TeraSort, Sleep, SQL, PageRank, Nutch indexing, Bayes, Kmeans, NWeight and enhanced DFSIO, etc. It also contains several streaming workloads for Spark Streaming, Flink, Storm and Gearpump. There are totally 19 workloads in HiBench. The workloads are divided into 6 categories which are micro, ml(machine learning), sql, graph, websearch and streaming.

SparkInternals - Notes talking about the design and implementation of Apache Spark

  •    

This series discuss the design and implementation of Apache Spark, with focuses on its design principles, execution mechanisms, system architecture and performance optimization. In addition, there's some comparisons with Hadoop MapReduce in terms of design and implementation. I'm reluctant to call this document a "code walkthrough", because the goal is not to analyze each piece of code in the project, but to understand the whole system in a systematic way (through analyzing the execution procedure of a Spark job, from its creation to completion). There're many ways to discuss a computer system. Here, We've chosen a problem-driven approach. Firstly one concrete problem is introduced, then it gets analyzed step by step. We'll start from a typical Spark example job and then discuss all the related important system modules. I believe that this approach is better than diving into each module right from the beginning.

LearningSpark - Scala examples for learning to use Spark

  •    Scala

This project contains snippets of Scala code for illustrating various Apache Spark concepts. It is intended to help you get started with learning Apache Spark (as a Scala programmer) by providing a super easy on-ramp that doesn't involve Unix, cluster configuration, building from sources or installing Hadoop. Many of these activities will be necessary later in your learning experience, after you've used these examples to achieve basic familiarity. It is intended to accompany a number of posts on the blog A River of Bytes.

MMLSpark - Microsoft Machine Learning for Apache Spark

  •    Scala

MMLSpark provides a number of deep learning and data science tools for Apache Spark, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK) and OpenCV, enabling you to quickly create powerful, highly-scalable predictive and analytical models for large image and text datasets.MMLSpark requires Scala 2.11, Spark 2.1+, and either Python 2.7 or Python 3.5+. See the API documentation for Scala and for PySpark.