Displaying 1 to 20 from 25 results

pachyderm - Reproducible Data Science at Scale!

  •    Go

Pachyderm is a tool for production data pipelines. If you need to chain together data scraping, ingestion, cleaning, munging, wrangling, processing, modeling, and analysis in a sane way, then Pachyderm is for you. If you have an existing set of scripts which do this in an ad-hoc fashion and you're looking for a way to "productionize" them, Pachyderm can make this easy for you. Install Pachyderm locally or deploy on AWS/GCE/Azure in about 5 minutes.

ClickHouse - Columnar DBMS and Real Time Analytics

  •    C++

ClickHouse is an open source column-oriented database management system capable of real time generation of analytical data reports using SQL queries. It is Linearly Scalable, Blazing Fast, Highly Reliable, Fault Tolerant, Data compression, Real time query processing, Web analytics, Vectorized query execution, Local and distributed joins. It can process hundreds of millions to more than a billion rows and tens of gigabytes of data per single server per second.




NakedTensor - Bare bone examples of machine learning in TensorFlow

  •    Python

This is a bare bones example of TensorFlow, a machine learning package published by Google. You will not find a simpler introduction to it. In each example, a straight line is fit to some data. Values for the slope and y-intercept of the line that best fit the data are determined using gradient descent. If you do not know about gradient descent, check out the Wikipedia page.

spaCy - 💫 Industrial-strength Natural Language Processing (NLP) with Python and Cython

  •    Python

spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest research, and was designed from day one to be used in real products. spaCy comes with pre-trained statistical models and word vectors, and currently supports tokenization for 20+ languages. It features the fastest syntactic parser in the world, convolutional neural network models for tagging, parsing and named entity recognition and easy deep learning integration. It's commercial open-source software, released under the MIT license. 💫 Version 2.0 out now! Check out the new features here.

Stream-Framework - Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis

  •    Python

Stream Framework is a python library which allows you to build activity streams & newsfeeds using Cassandra and/or Redis. If you're not using python have a look at Stream, which supports Node, Ruby, PHP, Python, Go, Scala, Java and REST. Stream Framework's authors also offer a web service for building scalable newsfeeds & activity streams at Stream. It allows you to create your feeds by talking to a beautiful and easy to use REST API. There are clients available for Node, Ruby, PHP, Python, Go, Scala and Java. The Get Started page explains the API & concept in a few clicks. It's a lot easier to use, free up to 3 million feed updates and saves you the hassle of maintaining Cassandra, Redis, Faye, RabbitMQ and Celery workers.

Hue - The open source Apache Hadoop UI

  •    Java

Hue is a Web application for interacting with Apache Hadoop. It supports a FileBrowser for accessing HDFS, JobBrowser for accessing MapReduce jobs (MR1/MR2-YARN), Job Designer for creating MapReduce/Streaming/Java jobs, HBase Browser for exploring and modifying HBase tables and data, Oozie App for submitting and scheduling workflows and bundles, A Pig/HBase/Sqoop2 shell, Beeswax application for executing Hive queries, Search app for querying Solr and Solr Cloud.


Postgres-XL - Scalable Open Source PostgreSQL-based Database Cluster

  •    C

Postgres-XL is a horizontally scalable open source SQL database cluster, flexible enough to handle varying database workloads like OLTP, Business Intelligence requiring MPP parallelism, Key value store, GIS Geospatial and lot more.

TrailDB - Efficient tool for storing and querying series of events

  •    C

TrailDB is a library, implemented in C, which allows you to query series of events at blazing speed. TrailDB is also optimized for speed of development: Use its simple API with your favorite language, in your favorite environment. TrailDB's secret sauce is data compression. It leverages predictability of time-based data to compress your data to a fraction of its original size. In contrast to traditional compression, you can query the encoded data directly, decompressing only the parts you need.

hpat - A compiler-based big data framework in Python

  •    Python

High Performance Analytics Toolkit (HPAT) scales analytics/ML codes in Python to bare-metal cluster/cloud performance automatically. It compiles a subset of Python (Pandas/Numpy) to efficient parallel binaries with MPI, requiring only minimal code changes. HPAT is orders of magnitude faster than alternatives like Apache Spark. HPAT's documentation can be found here.

conjure-up - Deploying complex solutions, magically.

  •    Python

Installing big software like whoa.This is the runtime application for processing spells to get those big software solutions up and going with as little hindrance as possible.

clusterdock - clusterdock is a framework for creating Docker-based container clusters

  •    Python

clusterdock is a Python 3 project that enables users to build, start, and manage Docker container-based clusters. It uses a pluggable system for defining new types of clusters using folders called topologies and is a swell project, if I may say so myself.

acousticbrainz-server - The server components for the AcousticBrainz project

  •    Python

The server components for the AcousticBrainz project. Full installation instructions are available in INSTALL.md file. After installing, continue the following steps.

listenbrainz-server - Server for the ListenBrainz project

  •    Python

The ListenBrainz project is similar to the original AudioScrobbler®. Unlike the original project, ListenBrainz is open source and publishes its data as open data. A team of former Last.fm and current MusicBrainz hackers created the first version of ListenBrainz in a weekend. Since the original project was created, technology has advanced at an incredibly rapid pace, which made re-creating the original project fairly straightforward.

dvid - Distributed, Versioned, Image-oriented Dataservice

  •    Go

Status: In production use at Janelia. See wiki page for outside lab use of DVID. See the DVID Wiki for more information including installation and examples of use.

hazelcast-go-client - Hazelcast IMDG Go Client

  •    Go

Go client implementation for Hazelcast, the open source in-memory data grid. Go client is implemented using the Hazelcast Open Binary Client Protocol.

hazelcast-python-client - Hazelcast IMDG Python Client

  •    Python

Python client implementation for Hazelcast, the open source in-memory data grid. Please take a look at our Getting Started guide.

aws-etl-orchestrator - A serverless architecture for orchestrating ETL jobs in arbitrarily-complex workflows using AWS Step Functions and AWS Lambda

  •    Python

Extract, transform, and load (ETL) operations collectively form the backbone of any modern enterprise data lake. It transforms raw data into useful datasets and, ultimately, into actionable insight. An ETL job typically reads data from one or more data sources, applies various transformations to the data, and then writes the results to a target where data is ready for consumption. The sources and targets of an ETL job could be relational databases in Amazon Relational Database Service (Amazon RDS) or on-premises, a data warehouse such as Amazon Redshift, or object storage such as Amazon Simple Storage Service (Amazon S3) buckets. Amazon S3 as a target is especially commonplace in the context of building a data lake in AWS. AWS offers AWS Glue, which is a service that helps author and deploy ETL jobs. AWS Glue is a fully managed extract, transform, and load service that makes it easy for customers to prepare and load their data for analytics. Other AWS Services also can be used to implement and manage ETL jobs. They include: AWS Database Migration Service (AWS DMS), Amazon EMR (using the Steps API), and even Amazon Athena.