BioAlignments.jl - Sequence alignment tools

  •        46

BioAlignments provides alignment algorithms, data structures, and I/O tools for SAM and BAM file formats. If you are interested in the cutting edge of the development, please check out the master branch to try new features before release.



Related Projects

bwa - Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)

  •    C

Note: minimap2 has replaced BWA-MEM for PacBio and Nanopore read alignment. It retains all major BWA-MEM features, but is ~50 times as fast, more versatile, more accurate and produces better base-level alignment. BWA is a software package for mapping DNA sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences ranged from 70bp to a few megabases. BWA-MEM and BWA-SW share similar features such as the support of long reads and chimeric alignment, but BWA-MEM, which is the latest, is generally recommended as it is faster and more accurate. BWA-MEM also has better performance than BWA-backtrack for 70-100bp Illumina reads.

cutadapt - cutadapt removes adapter sequences from sequencing reads

  •    Python

Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads. Cleaning your data in this way is often required: Reads from small-RNA sequencing contain the 3’ sequencing adapter because the read is longer than the molecule that is sequenced. Amplicon reads start with a primer sequence. Poly-A tails are useful for pulling out RNA from your sample, but often you don’t want them to be in your reads.


  •    Java

NeoBio is a Java class library of Computational Biology Algorithms. The current version consists mainly of pairwise sequence alignment algorithms such as the classical dynamic programming methods of Needleman-Wunsch and Smith-Waterman.

DNA Sequence Annotation Studio


Sequence Annotation Studio can be used to perform the following: 1. View the complete sequence with zoom in / zoom out facility. 2. View the annotations (all or selectively). 3. Create new annotation and edit or delete existing ones. 4. Save modified sequence in GenBank ...

sambamba - Tools for working with SAM/BAM/CRAM data

  •    D

Sambamba is a high performance highly parallel robust and fast tool (and library), written in the D programming language, for working with SAM and BAM files. Because of its efficiency is an important work horse running in many sequencing centres around the world today. Current functionality is an important subset of samtools functionality, including view, index, sort, markdup, and depth. Most tools support piping: just specify /dev/stdin or /dev/stdout as filenames. When we started writing sambamba (in 2012) the main advantage over samtools was parallelized BAM reading and writing. In March 2017 samtools 1.4 was released, reaching parity on this. A recent performance comparison shows that sambamba holds its ground and can do better in different configurations. Here are some comparison metrics. For example for flagstat sambamba is 1.4x faster than samtools. For index they are similar. For Markdup almost 6x faster and for view 4x faster. For sort sambamba has been beaten generally, though sambamba is up to 2x faster on large RAM machines.



Multi-functional batch sequence aligner incorporating Needleman-Wunsch, Smith-Waterman and Oommen-Kashyap algorithms along with compound alignment of secondary sequences.

picard - A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF

  •    Java

A set of Java command line tools for manipulating high-throughput sequencing (HTS) data and formats. Picard is implemented using the HTSJDK Java library HTSJDK to support accessing file formats that are commonly used for high-throughput sequencing data such as SAM and VCF.

SAM tools

  •    C

SAM (Sequence Alignment/Map) is a flexible generic format for storing nucleotide sequence alignment. SAM tools provide efficient utilities on manipulating alignments in the SAM format.


  •    Java

GATA is a graphic alignment tool for comparative sequence analysis. It makes use of BLAST to graphically align two DNA sequences, creating box- line- box representations of window scored local alignments. GATA also displays extensive GFF gene annotation.


  •    Java

JAligner is an open source Java implementation of the dynamic programming algorithm Smith-Waterman with Gotoh's improvement for biological local pairwise sequence alignment with the affine gap penalty model.

htslib - C library for high-throughput sequencing data formats

  •    C

HTSlib is an implementation of a unified C library for accessing common file formats, such as SAM, CRAM and VCF, used for high-throughput sequencing data, and is the core library used by samtools and bcftools. HTSlib only depends on zlib. It is known to be compatible with gcc, g++ and clang. HTSlib implements a generalized BAM index, with file extension .csi (coordinate-sorted index). The HTSlib file reader first looks for the new index and then for the old if the new index is absent.

freebayes - Bayesian haplotype-based genetic polymorphism discovery and genotyping.

  •    C++

FreeBayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment. FreeBayes uses short-read alignments (BAM files with Phred+33 encoded quality scores, now standard) for any number of individuals from a population and a reference genome (in FASTA format) to determine the most-likely combination of genotypes for the population at each position in the reference. It reports positions which it finds putatively polymorphic in variant call file (VCF) format. It can also use an input set of variants (VCF) as a source of prior information, and a copy number variant map (BED) to define non-uniform ploidy variation across the samples under analysis.


  •    Java

FSA is a probabilistic multiple sequence alignment algorithm which uses a quot;distance-basedquot; approach to aligning homologous protein, RNA or DNA sequences.

pysam - Pysam is a Python module for reading and manipulating SAM/BAM/VCF/BCF files

  •    C

Pysam is a python module for reading and manipulating files in the SAM/BAM format. The SAM/BAM format is a way to store efficiently large numbers of alignments (Li 2009), such as those routinely created by next-generation sequencing methods. Pysam is a lightweight wrapper of the samtools C-API. Pysam also includes an interface for tabix.

Sequence Quality Control Studio (SeQCoS)


Sequence Quality Control Studio (SeQCoS) is an open source .NET software suite designed to perform quality control (QC) of massively parallel sequencing reads. It includes tools for evaluating sequence and base quality of reads, as well as a set of basic post-QC sequence manip...


  •    Perl

Software for storing and analysing bacterial sequence data

.NET Bio

  •    DotNet

.Net Bio is a language-neutral bioinformatics toolkit built using the Microsoft 4.5 .NET Framework to help developers, researchers, and scientists.

Sequences studio

  •    Java

Sequence studio main package provides classes and interfaces for various kinds of sequence alignment. Differently from regular expressions it computes similarity with two initially unknown strings. Project page provides code generating applet.

bioawk - BWK awk modified for biological data

  •    C

Bioawk is an extension to Brian Kernighan's awk, adding the support of several common biological data formats, including optionally gzip'ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with column names. It also adds a few built-in functions and an command line option to use TAB as the input/output delimiter. When the new functionality is not used, bioawk is intended to behave exactly the same as the original BWK awk. The original awk requires a YACC-compatible parser generator (e.g. Byacc or Bison). Bioawk further depends on zlib so as to work with gzip'd files.

deepvariant - DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data

  •    Python

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.DeepVariant is a suite of Python/C++ programs that run on any Unix-like operating system. For convenience the documentation refers to building and running DeepVariant on Google Cloud Platform, but the tools themselves can be built and run on any standard Linux computer, including on-premise machines. Note that DeepVariant currently requires Python 2.7 and does not yet work with Python 3.