annoy - Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

  •    C++

Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.To install, simply do sudo pip install annoy to pull down the latest version from PyPI.

Non-Metric Space Library (NMSLIB) - An efficient similarity search library and a toolkit for evaluation of k-NN methods for generic non-metric spaces.

  •    C++

Non-Metric Space Library (NMSLIB) is an efficient cross-platform similarity search library and a toolkit for evaluation of similarity search methods. The core-library does not have any third-party dependencies. It has been gaining popularity recently. In particular, it has become a part of Amazon Elasticsearch Service.

tlsh - TLSH lib in Golang

  •    Go

TLSH is a fuzzy matching library. Given a byte stream with a minimum length of 256 bytes, TLSH generates a hash value which can be used for similarity comparisons. Similar objects will have similar hash values which allows for the detection of similar objects by comparing their hash values. Note that the byte stream should have a sufficient amount of complexity. For example, a byte stream of identical bytes will not generate a hash value. The computed hash is 35 bytes long (output as 70 hexidecimal charactes). The first 3 bytes are used to capture the information about the file as a whole (length, ...), while the last 32 bytes are used to capture information about incremental parts of the file.

soundfingerprinting - The project aims studying the audio signal in terms of its perceptual characteristics, resulting in an algorithm that will be able to detect (map) unknown audio snippets from a large database of known songs

  •    CSharp

soundfingerprinting is a C# framework designed for developers, enthusiasts, researchers in the fields of audio and digital signal processing, data mining and audio recognition. It implements an efficient algorithm which provides fast insert and retrieval of acoustic fingerprints with high precision and recall rate. Below code snippet shows how to extract acoustic fingerprints from an audio file and later use them as identifiers to recognize unknown audio query. These sub-fingerprints (or fingerprints, 2 terms are used interchangeably) will be stored in a configurable backend. The interfaces for fingerprinting and querying audio files are implemented as Fluent Interfaces.

spamsum - A native go implementation of spamsum

  •    Go

This is a native go implementation of spamsum. spamsum was developed by Andrew Tridgell to hash email messages for computationally inexpensive SPAM detection. See http://junkcode.samba.org/#spamsum.

ExpressionMatrix2 - Software for exploration of gene expression data from single-cell RNA sequencing

  •    C++

This repository contains software for analysis, visualization, and clustering of gene expression data from single-cell RNA sequencing developed at Chan-Zuckerberg Initiative. It scales favorably to large numbers of cells thank to its use of Locality-Sensitive Hashing (LSH), and was successfully used, without downsampling, on a data set with over one million cells. Documentation for the latest version of this software is available online through GitHub Pages, or you can use the directions below to obtain documentation for any previous release.