hail - Scalable genomic data analysis.

  •        36

Hail is an open-source, scalable framework for exploring and analyzing genomic data. The Hail project began in Fall 2015 to empower the worldwide genetics community to harness the flood of genomes to discover the biology of human disease. Since then, Hail has expanded to enable analysis of large-scale datasets beyond the field of genomics.




Related Projects

Mail Hail


Mail Hail is a mail notification agent for the Jabber instant messaging server. Mail Hail includes Hailbox, an experimental local mail delivery agent.

nucleus - Python and C++ code for reading and writing genomics data.

  •    Python

Nucleus is a library of Python and C++ code designed to make it easy to read, write and analyze data in common genomics file formats like SAM and VCF. In addition, Nucleus enables painless integration with the TensorFlow machine learning framework, as anywhere a genomics file is consumed or produced, a TensorFlow tfrecords file may be substituted. For all other systems, you will need to first install CLIF by following the instructions at https://github.com/google/clif#installation before running install.sh.

bionode - Modular and universal bioinformatics

  •    Javascript

To use bionode as a command line tool, you can install it globally with -g. Or, if you want to use it as a JavaScript library, you need to install it in your local project folder inside the node_modules directory by doing the same command without -g.

gatk - Official code repository for GATK versions 4 and up

  •    Java

Please see the GATK website, where you can download a precompiled executable, read documentation, ask questions, and receive technical support. This repository contains the next generation of the Genome Analysis Toolkit (GATK). The contents of this repository are 100% open source and released under the BSD 3-Clause license (see LICENSE.TXT).

gemini - a lightweight db framework for exploring genetic variation.

  •    Python

The intent of GEMINI (GEnome MINIing) is to provide a simple, flexible, and powerful framework for exploring genetic variation for personal and medical genetics. GEMINI is unique in that it integrates genetic variation (from VCF files) with a wealth of genome annotations into a unified database framework. Using this integrated database as the analysis framework, we aim to leverage the expressive power of SQL for data analysis, while attempting to overcome the fundamental challenges associated with using databases for very large (e.g. 1,000,000 variants times 1,000 samples yields one billion genotypes) datasets. In addition, by defining sample relationships with a PED file, GEMINI allows one to explore and test for variants that meet specific inheritance models (e.g., recessive, dominant, etc.). The following is a video of a high-level talk from SciPy 2013 describing GEMINI.

bedtools2 - A powerful toolset for genome arithmetic.

  •    C++

Collectively, the bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. The most widely-used tools enable genome arithmetic: that is, set theory on the genome. For example, bedtools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF. While each individual tool is designed to do a relatively simple task (e.g., intersect two interval files), quite sophisticated analyses can be conducted by combining multiple bedtools operations on the UNIX command line.

bioawk - BWK awk modified for biological data

  •    C

Bioawk is an extension to Brian Kernighan's awk, adding the support of several common biological data formats, including optionally gzip'ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with column names. It also adds a few built-in functions and an command line option to use TAB as the input/output delimiter. When the new functionality is not used, bioawk is intended to behave exactly the same as the original BWK awk. The original awk requires a YACC-compatible parser generator (e.g. Byacc or Bison). Bioawk further depends on zlib so as to work with gzip'd files.

deepvariant - DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data

  •    Python

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.DeepVariant is a suite of Python/C++ programs that run on any Unix-like operating system. For convenience the documentation refers to building and running DeepVariant on Google Cloud Platform, but the tools themselves can be built and run on any standard Linux computer, including on-premise machines. Note that DeepVariant currently requires Python 2.7 and does not yet work with Python 3.



SysGenSIM is a bioinformatics toolbox to create artificial gene expression datasets by simulating Systems Genetics experiments.

jbrowse - A modern genome browser built with JavaScript and HTML5.

  •    Javascript

To install jbrowse, most users should visit http://jbrowse.org/install and download a zip file such as JBrowse-1.13.0.zip. See instructions at https://jbrowse.org/code/latest-release/docs/tutorial/ for a tutorial on setting up a sample instance. Once you have an instance up and running, http://gmod.org/wiki/JBrowse_Configuration_Guide is the comprehensive reference guide to JBrowse configuration.

vcflib - a simple C++ library for parsing and manipulating VCF files, + many command-line utilities

  •    C++

The Variant Call Format (VCF) is a flat-file, tab-delimited textual format intended to concisely describe reference-indexed variations between individuals. VCF provides a common interchange format for the description of variation in individuals and populations of samples, and has become the defacto standard reporting format for a wide array of genomic variant detectors. The API itself provides a quick and extremely permissive method to read and write VCF files. Extensions and applications of the library provided in the included utilities (*.cpp) comprise the vast bulk of the library's utility for most users.

VCF Builder IDE

  •    C++

The VCF Builder is an advanced development tool for creating C++ applications, and supporting a wide number of plugins for enhancing it's functionality. While the VCF Builder is capable of creating generic C++ applications, it's forte is building GUI ap

snappydata - SnappyData - The Spark Database. Stream, Transact, Analyze, Predict in one cluster

  •    Scala

Apache Spark is a general purpose parallel computational engine for analytics at scale. At its core, it has a batch design center and is capable of working with disparate data sources. While this provides rich unified access to data, this can also be quite inefficient and expensive. Analytic processing requires massive data sets to be repeatedly copied and data to be reformatted to suit Spark. In many cases, it ultimately fails to deliver the promise of interactive analytic performance. For instance, each time an aggregation is run on a large Cassandra table, it necessitates streaming the entire table into Spark to do the aggregation. Caching within Spark is immutable and results in stale insight. At SnappyData, we take a very different approach. SnappyData fuses a low latency, highly available in-memory transactional database (GemFireXD) into Spark with shared memory management and optimizations. Data in the highly available in-memory store is laid out using the same columnar format as Spark (Tungsten). All query engine operators are significantly more optimized through better vectorization and code generation. The net effect is, an order of magnitude performance improvement when compared to native Spark caching, and more than two orders of magnitude better Spark performance when working with external data sources.

Complete Genomics Analysis Tools


The Complete Genomics Analysis Tools is an open source project to provide tools to simplify analysis of genomics data produced by Complete Genomics.

galaxy - Data intensive science for everyone.

  •    Python

You may wish to make changes from the default configuration. This can be done in the config/galaxy.ini file. Note that not all dependencies for the tools provided in the tool_conf.xml.sample are included. To install them please visit "Manage dependencies" in the admin interface.

spark-movie-lens - An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset

  •    Jupyter

This Apache Spark tutorial will guide you step-by-step into how to use the MovieLens dataset to build a movie recommender using collaborative filtering with Spark's Alternating Least Saqures implementation. It is organised in two parts. The first one is about getting and parsing movies and ratings data into Spark RDDs. The second is about building and using the recommender and persisting it for later use in our on-line recommender system. This tutorial can be used independently to build a movie recommender model based on the MovieLens dataset. Most of the code in the first part, about how to use ALS with the public MovieLens dataset, comes from my solution to one of the exercises proposed in the CS100.1x Introduction to Big Data with Apache Spark by Anthony D. Joseph on edX, that is also publicly available since 2014 at Spark Summit. Starting from there, I've added with minor modifications to use a larger dataset, then code about how to store and reload the model for later use, and finally a web service using Flask.

spark - .NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

  •    CSharp

.NET for Apache Spark provides high performance APIs for using Apache Spark from C# and F#. With these .NET APIs, you can access the most popular Dataframe and SparkSQL aspects of Apache Spark, for working with structured data, and Spark Structured Streaming, for working with streaming data. .NET for Apache Spark is compliant with .NET Standard - a formal specification of .NET APIs that are common across .NET implementations. This means you can use .NET for Apache Spark anywhere you write .NET code allowing you to reuse all the knowledge, skills, code, and libraries you already have as a .NET developer.

flint - A Time Series Library for Apache Spark

  •    Scala

The ability to analyze time series data at scale is critical for the success of finance and IoT applications based on Spark. Flint is Two Sigma's implementation of highly optimized time series operations in Spark. It performs truly parallel and rich analyses on time series data by taking advantage of the natural ordering in time series data to provide locality-based optimizations. Flint is an open source library for Spark based around the TimeSeriesRDD, a time series aware data structure, and a collection of time series utility and analysis functions that use TimeSeriesRDDs. Unlike DataFrame and Dataset, Flint's TimeSeriesRDDs can leverage the existing ordering properties of datasets at rest and the fact that almost all data manipulations and analysis over these datasets respect their temporal ordering properties. It differs from other time series efforts in Spark in its ability to efficiently compute across panel data or on large scale high frequency data.

spark-ec2 - Scripts used to setup a Spark cluster on EC2

  •    Python

Please note: spark-ec2 is no longer under active development and the project has been archived. All the existing code, PRs and issues are still accessible but are now read-only. If you're looking for a similar tool that is under active development, we recommend you take a look at Flintrock. spark-ec2 allows you to launch, manage and shut down Apache Spark [1] clusters on Amazon EC2. It automatically sets up Apache Spark and HDFS on the cluster for you. This guide describes how to use spark-ec2 to launch clusters, how to run jobs on them, and how to shut them down. It assumes you've already signed up for an EC2 account on the Amazon Web Services site.