Displaying 1 to 11 from 11 results

datasketch - MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++

  •    Python

datasketch gives you probabilistic data structures that can process and search very large amount of data super fast, with little loss of accuracy. datasketch must be used with Python 2.7 or above and NumPy 1.11 or above. Scipy is optional, but with it the LSH initialization can be much faster.

Non-Metric Space Library (NMSLIB) - An efficient similarity search library and a toolkit for evaluation of k-NN methods for generic non-metric spaces.

  •    C++

Non-Metric Space Library (NMSLIB) is an efficient cross-platform similarity search library and a toolkit for evaluation of similarity search methods. The core-library does not have any third-party dependencies. It has been gaining popularity recently. In particular, it has become a part of Amazon Elasticsearch Service.

scanns - A scalable nearest neighbor search library in Apache Spark

  •    Scala

ScANNS is a nearest neighbor search library for Apache Spark originally developed by Namit Katariya from the LinkedIn Machine Learning Algorithms team. It enables nearest neighbor search in a batch offline context within the cosine, jaccard and euclidean distance spaces. This library has been tested to scale to hundreds of millions to low billions of data points.

TarsosLSH - A Java library implementing practical nearest neighbour search algorithm for multidimensional vectors that operates in sublinear time

  •    Java

TarsosLSH is a Java library implementing sub-linear nearest neigbour search algorithms. It contains both an approximate and an exact search algorithm. The first, Locality-sensitive Hashing (LSH) is a randomized approximate search algorithm for a number of search spaces. The second, Multi-index hashing is an exact nearest neigbour search algorithm which is limited to Hamming space. Locality-sensitive Hashing (LSH), a practical nearest neighbour search algorithm for multidimensional vectors that operates in sublinear time. It supports several Locality Sensitive Hashing (LSH) families: the Euclidean hash family (L2), city block hash family (L1) and cosine hash family. The library tries to hit the sweet spot between being capable enough to get real tasks done, and compact enough to serve as a demonstration on how LSH works.




lsh - Locality Sensitive Hashing for Go (Multi-probe LSH, LSH Forest, basic LSH)

  •    Go

This library includes various Locality Sensitive Hashing (LSH) algorithms for the approximate nearest neighbour search problem in L2 metric space. The family of LSH functions for L2 is the work of Mayur Datar et.al.

lshensemble - LSH index for approximate set containment search

  •    Go

Presentation slides @ VLDB 2016, New Delhi. We used two datasets for evaluation. The datasets are all from public domains and can be downloaded directly from the original publisher.

minhash-lsh - Minhash LSH in Golang

  •    Go

If the parameter firstItemIsID is set to true, the first itme is the unique ID of the set.

minhashcuda - Weighted MinHash implementation on CUDA (multi-gpu).

  •    C++

This project is the reimplementation of Weighted MinHash calculation from ekzhu/datasketch in NVIDIA CUDA and thus brings 600-1000x speedup over numpy with MKL (Titan X 2016 vs 12-core Xeon E5-1650). It supports running on multiple GPUs to be even faster, e.g., processing 10Mx12M matrix with sparsity 0.0014 takes 40 minutes using two Titan Xs. The produced results are bit-to-bit identical to the reference implementation. Read the article. The input format is 32-bit float CSR matrix. The code is optimized for low memory consumption and speed.


consimilo - A Clojure library for querying large data-sets on similarity

  •    Clojure

consimilo is a library that utilizes locality sensitive hashing (implemented as lsh-forest) and minhashing, to support top-k similar item queries. Finding similar items across expansive data-sets is a common problem that presents itself in many real world applications (e.g. finding articles from the same source, plagiarism detection, collaborative filtering, context filtering, document similarity, etc...). Searching a corpus for top-k similar items quickly grows to an unwieldy complexity at relatively small corpus sizes (n choose 2). LSH reduces the search space by "hashing" items in such a way that collisions occur as a result of similarity. Once the items are hashed and indexed the lsh-forest supports a top-k most similar items query of ~O(log n). There is an accuracy trade-off that comes with the enormous increase in query speed. More information can be found in chapter 3 of Mining Massive Datasets. You can continue to add to this forest by passing it as the first argument to add-all-to-forest. The forest data structure is stored in an atom, so the existing forest is modified in place.

groot - A resistome profiler for Graphing Resistance Out Of meTagenomes

  •    Go

GROOT is a tool to type Antibiotic Resistance Genes (ARGs) in metagenomic samples (a.k.a. Resistome Profiling). It combines variation graph representation of gene sets with an LSH indexing scheme to allow for fast classification of metagenomic reads. Subsequent hierarchical local alignment of classified reads against graph traversals facilitates accurate reconstruction of full-length gene sequences using a simple scoring scheme. GROOT will output an ARG alignment file (in BAM format) that contains the graph traversals possible for each query read; the alignment file is then used by GROOT to generate a resistome profile.

likelike - An implementation of locality sensitive hashing with Hadoop

  •    Java

Likelike is an implementation of LSH (locality sensitive hashing) on Hadoop. This program can be used for the nearest neighbor extraction or item recommendation in E-commerce sites. Currently Likelike supports only Min-Wise independent permutations. Min-Wise independent permutations is applied to the recommendation of Google News (Das et al. 2007). Begin with the Likelike quick start page (QuickStart) which provides the information on the installation and tutorial with small input files. For detailed usage, please visit the Usage page.