Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. To run programs faster, Spark offers a general execution model that can optimize arbitrary operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop.
http://spark.incubator.apache.org/Tags | cluster cluster-computing data-analytics analytics hdfs map-reduce big-data |
Implementation | Scala |
License | Apache |
Platform | OS-Independent |
High Performance Analytics Toolkit (HPAT) scales analytics/ML codes in Python to bare-metal cluster/cloud performance automatically. It compiles a subset of Python (Pandas/Numpy) to efficient parallel binaries with MPI, requiring only minimal code changes. HPAT is orders of magnitude faster than alternatives like Apache Spark. HPAT's documentation can be found here.
big-data parallel-computing compilers machine-learning numpy pandasThe Stratosphere System is an open-source cluster/cloud computing framework for Big Data analytics. It comprises of An extensible higher level language (Meteor) to quickly compose queries for common and recurring use cases, A parallel programming model (PACT, an extension of MapReduce) to run user-defined operations, An efficient massively parallel runtime (Nephele) for fault tolerant execution of acyclic data flows.
cloud-framework cloud big-data parallel information-managementApache Spark is a general purpose parallel computational engine for analytics at scale. At its core, it has a batch design center and is capable of working with disparate data sources. While this provides rich unified access to data, this can also be quite inefficient and expensive. Analytic processing requires massive data sets to be repeatedly copied and data to be reformatted to suit Spark. In many cases, it ultimately fails to deliver the promise of interactive analytic performance. For instance, each time an aggregation is run on a large Cassandra table, it necessitates streaming the entire table into Spark to do the aggregation. Caching within Spark is immutable and results in stale insight. At SnappyData, we take a very different approach. SnappyData fuses a low latency, highly available in-memory transactional database (GemFireXD) into Spark with shared memory management and optimizations. Data in the highly available in-memory store is laid out using the same columnar format as Spark (Tungsten). All query engine operators are significantly more optimized through better vectorization and code generation. The net effect is, an order of magnitude performance improvement when compared to native Spark caching, and more than two orders of magnitude better Spark performance when working with external data sources.
snappydata spark memory-database analytics stream transaction scaleApache Hadoop is a framework for running applications on large clusters built of commodity hardware. Hadoop common supports other Hadoop subprojects
cluster map-reduce distributed-file-systemMagellan is a distributed execution engine for geospatial analytics on big data. It is implemented on top of Apache Spark and deeply leverages modern database techniques like efficient data layout, code generation and query optimization in order to optimize geospatial queries. The application developer writes standard sql or data frame queries to evaluate geometric expressions while the execution engine takes care of efficiently laying data out in memory during query processing, picking the right query plan, optimizing the query execution with cheap and efficient spatial indices while presenting a declarative abstraction to the developer.
geospatial-analytics sparksql spark geometric-algorithms geojson shapefile geospatial geospatial-processing geospatial-analysis big-data magellanApache Mnemonic is a non-volatile hybrid memory storage oriented library, it proposed a non-volatile/durable Java object model and durable computing service that bring several advantages to significantly improve the performance of massive real-time data processing/analytics. developers are able to use this library to design their cache-less and SerDe-less high performance applications.
memory non-volatile-memory cacheApache REEF (Retainable Evaluator Execution Framework) is a library for developing portable applications for cluster resource managers such as Apache Hadoop YARN or Apache Mesos. For example, Microsoft Azure Stream Analytics is built on REEF and Hadoop.
cluster-management resource-manager big-data data-processingGeoMesa is an open-source, distributed, spatio-temporal database built on a number of distributed cloud data storage systems, including Accumulo, HBase, Cassandra, and Kafka. Leveraging a highly parallelized indexing strategy, GeoMesa aims to provide as much of the spatial querying and data manipulation to Accumulo as PostGIS does to Postgres.
database geospatial-database spatial-database geospatial geospatial-analytics mapsThis docker image adds high-performance graph analytics to a Neo4j graph database. This image deploys a container with Apache Spark and uses GraphX to perform ETL graph analysis on subgraphs exported from Neo4j. The results of the analysis are applied back to the data in the Neo4j database. The Neo4j Mazerunner service in this image is a unmanaged extension that adds a REST API endpoint to Neo4j for submitting graph analysis jobs to Apache Spark GraphX. The results of the analysis are applied back to the nodes in Neo4j as property values, making the results queryable using Cypher.
Disco is a distributed map-reduce and big-data framework. Like the original framework, which was publicized by Google, Disco supports parallel computations over large data sets on an unreliable cluster of computers. This makes it a perfect tool for analyzing and processing large datasets without having to bother about difficult technical questions related to distributed computing, such as communication protocols, load balancing, locking, job scheduling or fault tolerance, all of which are taken care by Disco. Note: For installing Disco, you cannot use the zip or tar.gz packages generated by github, instead you should clone this repository.
Alluxio (formerly known as Tachyon) is a virtual distributed storage system. It bridges the gap between computation frameworks and storage systems, enabling computation applications to connect to numerous storage systems through a common interface.
distributed-storage big-data memory-speed hadoop spark virtual-file-system presto tensorflow storage object-storeLike my work? I am Principal Consultant at Data Syndrome, a consultancy offering assistance and training with building full-stack analytics products, applications and systems. Find us on the web at datasyndrome.com. There is now a video course using code from chapter 8, Realtime Predictive Analytics with Kafka, PySpark, Spark MLlib and Spark Streaming. Check it out now at datasyndrome.com/video.
data-syndrome data data-science analytics apache-spark apache-kafka kafka spark predictive-analytics machine-learning machine-learning-algorithms airflow python-3 python3 amazon-ec2 agile-data agile-data-science vagrant amazon-web-servicesSnowplow is an enterprise-strength marketing and product analytics platform. It identifies your users, and tracks the way they engage with your website or application. It stores your users' behavioural data in a scalable "event data warehouse" you control: in Amazon S3 and (optionally) Amazon Redshift or Postgres. Lets you leverage the biggest range of tools to analyze that data, including big data tools (e.g. Spark) via EMR or more traditional tools e.g. Looker, Mode, Superset, Re:dash to analyze that behavioural data.
analytics event-analytics cloud kafka awsHoodie is a Apache Spark library that provides the ability to efficiently do incremental processing on datasets in HDFS
hadoop spark parquet analytics-database ingestion hoodie hudi columnar storageApache Tajo is a robust big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL (extract-transform-load process) on large-data sets stored on HDFS (Hadoop Distributed File System) and other data sources.
data-warehouse etl aggregation analytics sql-on-hadoopTensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters. By combining salient features from deep learning framework TensorFlow and big-data frameworks Apache Spark and Apache Hadoop, TensorFlowOnSpark enables distributed deep learning on a cluster of GPU and CPU servers.TensorFlowOnSpark was developed by Yahoo for large-scale distributed deep learning on our Hadoop clusters in Yahoo's private cloud.
tensorflow spark yahoo machine-learning cluster featuredDerive useful insights from your data using Python. Learn the techniques related to natural language processing and text analytics, and gain the skills to know which technique is best suited to solve a particular problem. A structured and comprehensive approach is followed in this book so that readers with little or no experience do not find themselves overwhelmed. You will start with the basics of natural language and Python and move on to advanced analytical and machine learning concepts. You will look at each technique and algorithm with both a bird's eye view to understand how it can be used as well as with a microscopic view to understand the mathematical concepts and to implement them to solve your own problems.
text-analytics text-summarization text-classification natural-language natural-language-processing clustering sentiment semantic sentiment-analysis nltk stanford-nlp spacy pattern scikit-learn gensimMetron integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis. Metron provides capabilities for log aggregation, full packet capture indexing, storage, advanced behavioral analytics and data enrichment, while applying the most current threat intelligence information to security telemetry within a single platform.
security-framework cyber-crime anomoly-detection monitoring security big-data threat-intelligence security-analytics opensoc siem threat-analyticsEvents for a given persistenceId are partitioned across nodes in the Cassandra cluster where the partition is represented by the partition field in the key. The eventTable() method returns an RDD in which events with the same persistenceId - partition combination (= cluster partition) are ordered by increasing sequenceNr but the ordering across cluster partitions is not defined. If needed the RDD can be sorted with sortByKey() by persistenceId, partition and sequenceNr in that order of significance. Btw, the default size of a cluster partition in the Cassandra journal is 5000000 events (see akka-persistence-cassandra). The stream of events (written by all persistent actors) is partially ordered i.e. events with the same persistenceId are ordered by sequenceNr whereas the ordering of events with different persistenceId is not defined. Details about Kafka consumer params are described here.
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
Add Projects.