Displaying 1 to 15 from 15 results

Faiss - A library for efficient similarity search and clustering of dense vectors.

  •    C++

Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy. Some of the most useful algorithms are implemented on the GPU. It is developed by Facebook AI Research.

Milvus - An open-source vector database for embedding similarity search and AI applications

  •    Go

Milvus is an open-source vector database built to power embedding similarity search and AI applications. Milvus makes unstructured data search more accessible, and provides a consistent user experience regardless of the deployment environment. Milvus 2.0 is a cloud-native vector database with storage and computation separated by design. All components in this refactored version of Milvus are stateless to enhance elasticity and flexibility.

similarity - TensorFlow Similarity is a python package focused on making similarity learning quick and easy

  •    Python

TensorFlow Similarity is a TensorFlow library for similarity learning also known as metric learning and contrastive learning. TensorFlow Similarity is still in beta.

SetSimilaritySearch - All-pair set similarity search on millions of sets in Python and on a laptop (faster than MinHash LSH)

  •    Python

Efficient set similarity search algorithms in Python. For even better performance see the Go Implementation. A popular way to measure the similarity between two sets is Jaccard similarity, which gives a fractional score between 0 and 1.0.

Non-Metric Space Library (NMSLIB) - An efficient similarity search library and a toolkit for evaluation of k-NN methods for generic non-metric spaces.

  •    C++

Non-Metric Space Library (NMSLIB) is an efficient cross-platform similarity search library and a toolkit for evaluation of similarity search methods. The core-library does not have any third-party dependencies. It has been gaining popularity recently. In particular, it has become a part of Amazon Elasticsearch Service.

apollo - Advanced similarity and duplicate source code proof of concept for our research efforts.

  •    Python

Advanced code deduplicator from hell. Powered by source{d} ML, source{d} engine and minhashcuda. Agnostic to the analysed language thanks to Babelfish. Python 3, PySpark, CUDA inside. source{d}'s effort to research and solve the code deduplication problem. At scale, as usual. A code clone is several snippets of code with few differences. For now this project focuses on find near-duplicate projects and files; it will eventually support functions and snippets in the future.

visualsearch - Visual Search is a little app to find and cluster similar images using Tagbox

  •    Go

Checking similarity bewteen images is done using Tagbox. Make sure you have it running on http://localhost:8080. You will need MB_KEY to run it visit https://machinebox.io to get it. Once you have Tagbox running you can do.

consimilo - A Clojure library for querying large data-sets on similarity

  •    Clojure

consimilo is a library that utilizes locality sensitive hashing (implemented as lsh-forest) and minhashing, to support top-k similar item queries. Finding similar items across expansive data-sets is a common problem that presents itself in many real world applications (e.g. finding articles from the same source, plagiarism detection, collaborative filtering, context filtering, document similarity, etc...). Searching a corpus for top-k similar items quickly grows to an unwieldy complexity at relatively small corpus sizes (n choose 2). LSH reduces the search space by "hashing" items in such a way that collisions occur as a result of similarity. Once the items are hashed and indexed the lsh-forest supports a top-k most similar items query of ~O(log n). There is an accuracy trade-off that comes with the enormous increase in query speed. More information can be found in chapter 3 of Mining Massive Datasets. You can continue to add to this forest by passing it as the first argument to add-all-to-forest. The forest data structure is stored in an atom, so the existing forest is modified in place.

FAST - End-to-end earthquake detection pipeline via efficient time series similarity search

  •    Shell

The following instructions were only tested on Linux clusters; we do not currently support other operating systems. To efficiently process inputs spanning a long duration, we suggest running the pipeline on a server with multiple processes and sufficient memory. Raw SAC files for each station are stored under data/waveforms${STATION}. Station "HEC" has 3 components so it should have 3 time series data files; the other stations have only 1 component.

TopSim - Efficiently search the most similar strings against the query in Python.

  •    Python

Search the most similar strings against the query in Python 3. State-of-the-art algorithm and data structure are adopted for best efficiency. For both flexibility and efficiency, only set-based similarities are included right now, including Jaccard and Tversky.

go-set-similarity-search - Efficient set similarity search algorithms implemented in Go

  •    Go

This is a mirror implementation of the Python SetSimilaritySearch library in Go, with better performance. Run AllPairs algorithm on 3.5 GHz Intel Core i7, using similarity function jaccard and similarity threshold 0.5.

imgsmlr - Similar images search for PostgreSQL

  •    C

ImgSmlr – is a PostgreSQL extension which implements similar images searching functionality. ImgSmlr method is based on Haar wavelet transform. The goal of ImgSmlr is not to provide most advanced state of art similar images searching methods. ImgSmlr was written as sample extension which illustrate how PostgreSQL extendability could cover such untypical tasks for RDBMS as similar images search.

anndb - Distributed Approximate Nearest Neighbors Database https://anndb.com

  •    Go

AnnDB is a horizontally scalable and distributed approximate nearest neighbors database. It is build from the ground up to scale to millions of high-dimensional vectors while providing low latency and high throughput. AnnDB uses a custom implementation of HNSW [1] to make search in high-dimensional vector spaces fast. It splits each dataset and its underlying index into partitions. Partitions are distributed and replicated across nodes in the cluster using Raft protocol [2] to ensure high availability and data durability in case of node failures. Search is performed in a map-reduce like fashion. Node that receives a search request from the client samples a node for each partition and sends partition search request to that node. Each of these nodes then searches requested partitions and aggregates results locally before sending it to the driver node which re-aggregates responses from all partitions and sends the result to the client.

dhash-vips - Ruby gem to measure images similarity

  •    Ruby

The dHash is the algorithm of image fingerprinting that can be used to measure the similarity of two images. The IDHash is the new algorithm that has some improvements over dHash -- I'll describe it further. You can read about the dHash and perceptual hashing in the article "Kind of Like That" at "The Hacker Factor Blog" (21 January 2013). The idea is that you resize the original image to 8x9 and then convert it to 8x8 array of bits -- each tells if the corresponding segment of the image is brighter or darker than the one on the right (or left). Then you apply the Hamming distance to such arrays to measure how much they are different.

awesome-metric-learning - 😎 A curated list of awesome practical Metric Learning and its applications


At Qdrant, we have one goal: make metric learning more practical. This listing is in line with this purpose, and we aim at providing a concise yet useful list of awesomeness around metric learning. It is intended to be inspirational for productivity rather than serve as a full bibliography. If you find it useful or like it in some other way, you may want to join our Discord server, where we are running a paper reading club on metric learning.

We have large collection of open source products. Follow the tags from Tag Cloud >>

Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.