Displaying 1 to 9 from 9 results

SetSimilaritySearch - All-pair set similarity search on millions of sets in Python and on a laptop (faster than MinHash LSH)

  •    Python

Efficient set similarity search algorithms in Python. For even better performance see the Go Implementation. A popular way to measure the similarity between two sets is Jaccard similarity, which gives a fractional score between 0 and 1.0.

Non-Metric Space Library (NMSLIB) - An efficient similarity search library and a toolkit for evaluation of k-NN methods for generic non-metric spaces.

  •    C++

Non-Metric Space Library (NMSLIB) is an efficient cross-platform similarity search library and a toolkit for evaluation of similarity search methods. The core-library does not have any third-party dependencies. It has been gaining popularity recently. In particular, it has become a part of Amazon Elasticsearch Service.

apollo - Advanced similarity and duplicate source code proof of concept for our research efforts.

  •    Python

Advanced code deduplicator from hell. Powered by source{d} ML, source{d} engine and minhashcuda. Agnostic to the analysed language thanks to Babelfish. Python 3, PySpark, CUDA inside. source{d}'s effort to research and solve the code deduplication problem. At scale, as usual. A code clone is several snippets of code with few differences. For now this project focuses on find near-duplicate projects and files; it will eventually support functions and snippets in the future.

visualsearch - Visual Search is a little app to find and cluster similar images using Tagbox

  •    Go

Checking similarity bewteen images is done using Tagbox. Make sure you have it running on http://localhost:8080. You will need MB_KEY to run it visit https://machinebox.io to get it. Once you have Tagbox running you can do.

consimilo - A Clojure library for querying large data-sets on similarity

  •    Clojure

consimilo is a library that utilizes locality sensitive hashing (implemented as lsh-forest) and minhashing, to support top-k similar item queries. Finding similar items across expansive data-sets is a common problem that presents itself in many real world applications (e.g. finding articles from the same source, plagiarism detection, collaborative filtering, context filtering, document similarity, etc...). Searching a corpus for top-k similar items quickly grows to an unwieldy complexity at relatively small corpus sizes (n choose 2). LSH reduces the search space by "hashing" items in such a way that collisions occur as a result of similarity. Once the items are hashed and indexed the lsh-forest supports a top-k most similar items query of ~O(log n). There is an accuracy trade-off that comes with the enormous increase in query speed. More information can be found in chapter 3 of Mining Massive Datasets. You can continue to add to this forest by passing it as the first argument to add-all-to-forest. The forest data structure is stored in an atom, so the existing forest is modified in place.

FAST - End-to-end earthquake detection pipeline via efficient time series similarity search

  •    Shell

The following instructions were only tested on Linux clusters; we do not currently support other operating systems. To efficiently process inputs spanning a long duration, we suggest running the pipeline on a server with multiple processes and sufficient memory. Raw SAC files for each station are stored under data/waveforms${STATION}. Station "HEC" has 3 components so it should have 3 time series data files; the other stations have only 1 component.

TopSim - Efficiently search the most similar strings against the query in Python.

  •    Python

Search the most similar strings against the query in Python 3. State-of-the-art algorithm and data structure are adopted for best efficiency. For both flexibility and efficiency, only set-based similarities are included right now, including Jaccard and Tversky.

imgsmlr - Similar images search for PostgreSQL

  •    C

ImgSmlr – is a PostgreSQL extension which implements similar images searching functionality. ImgSmlr method is based on Haar wavelet transform. The goal of ImgSmlr is not to provide most advanced state of art similar images searching methods. ImgSmlr was written as sample extension which illustrate how PostgreSQL extendability could cover such untypical tasks for RDBMS as similar images search.

go-set-similarity-search - Efficient set similarity search algorithms implemented in Go

  •    Go

This is a mirror implementation of the Python SetSimilaritySearch library in Go, with better performance. Run AllPairs algorithm on 3.5 GHz Intel Core i7, using similarity function jaccard and similarity threshold 0.5.