Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy. Some of the most useful algorithms are implemented on the GPU. It is developed by Facebook AI Research.
clustering similarity-search artificial-intelligence gpuMilvus is an open-source vector database built to power embedding similarity search and AI applications. Milvus makes unstructured data search more accessible, and provides a consistent user experience regardless of the deployment environment. Milvus 2.0 is a cloud-native vector database with storage and computation separated by design. All components in this refactored version of Milvus are stateless to enhance elasticity and flexibility.
database ai vector nearest-neighbor-search cloud-native image-search approximate-nearest-neighbor-search hacktoberfest embedding similarity-search video-search faiss anns hnsw vector-search milvus vector-database embeddings-similarity artificial-intelligenceTensorFlow Similarity is a TensorFlow library for similarity learning also known as metric learning and contrastive learning. TensorFlow Similarity is still in beta.
deep-learning tensorflow nearest-neighbor-search metric-learning nearest-neighbors similarity-search similarity-learning contrastive-learningEfficient set similarity search algorithms in Python. For even better performance see the Go Implementation. A popular way to measure the similarity between two sets is Jaccard similarity, which gives a fractional score between 0 and 1.0.
similarity-search set-similarity-search all-pairsNon-Metric Space Library (NMSLIB) is an efficient cross-platform similarity search library and a toolkit for evaluation of similarity search methods. The core-library does not have any third-party dependencies. It has been gaining popularity recently. In particular, it has become a part of Amazon Elasticsearch Service.
search search-library similarity-search algorithm knn-search non-metric neighborhood-graphs k-nn-graphs proximity-graphs lsh locality-sensitive-hashingAdvanced code deduplicator from hell. Powered by source{d} ML, source{d} engine and minhashcuda. Agnostic to the analysed language thanks to Babelfish. Python 3, PySpark, CUDA inside. source{d}'s effort to research and solve the code deduplication problem. At scale, as usual. A code clone is several snippets of code with few differences. For now this project focuses on find near-duplicate projects and files; it will eventually support functions and snippets in the future.
duplicates similarity source-code duplicate-detection similarity-searchChecking similarity bewteen images is done using Tagbox. Make sure you have it running on http://localhost:8080. You will need MB_KEY to run it visit https://machinebox.io to get it. Once you have Tagbox running you can do.
image-recognition similarity-search machineboxconsimilo is a library that utilizes locality sensitive hashing (implemented as lsh-forest) and minhashing, to support top-k similar item queries. Finding similar items across expansive data-sets is a common problem that presents itself in many real world applications (e.g. finding articles from the same source, plagiarism detection, collaborative filtering, context filtering, document similarity, etc...). Searching a corpus for top-k similar items quickly grows to an unwieldy complexity at relatively small corpus sizes (n choose 2). LSH reduces the search space by "hashing" items in such a way that collisions occur as a result of similarity. Once the items are hashed and indexed the lsh-forest supports a top-k most similar items query of ~O(log n). There is an accuracy trade-off that comes with the enormous increase in query speed. More information can be found in chapter 3 of Mining Massive Datasets. You can continue to add to this forest by passing it as the first argument to add-all-to-forest. The forest data structure is stored in an atom, so the existing forest is modified in place.
minhash-lsh-algorithm minhash lsh lsh-forest data-sketching data-sketches similarity similarity-search jaccard-similarity cosine-distance hamming-distance plagiarism-detection recommender-system collaborative-filtering document-similarityThe following instructions were only tested on Linux clusters; we do not currently support other operating systems. To efficiently process inputs spanning a long duration, we suggest running the pipeline on a server with multiple processes and sufficient memory. Raw SAC files for each station are stored under data/waveforms${STATION}. Station "HEC" has 3 components so it should have 3 time series data files; the other stations have only 1 component.
minhash-lsh-algorithm time-series earthquakes similarity-searchSearch the most similar strings against the query in Python 3. State-of-the-art algorithm and data structure are adopted for best efficiency. For both flexibility and efficiency, only set-based similarities are included right now, including Jaccard and Tversky.
similarity-search string-searchThis is a mirror implementation of the Python SetSimilaritySearch library in Go, with better performance. Run AllPairs algorithm on 3.5 GHz Intel Core i7, using similarity function jaccard and similarity threshold 0.5.
set-similarity-search all-pairs similarity-searchImgSmlr – is a PostgreSQL extension which implements similar images searching functionality. ImgSmlr method is based on Haar wavelet transform. The goal of ImgSmlr is not to provide most advanced state of art similar images searching methods. ImgSmlr was written as sample extension which illustrate how PostgreSQL extendability could cover such untypical tasks for RDBMS as similar images search.
postgresql gist image-processing similarity-search postgresAnnDB is a horizontally scalable and distributed approximate nearest neighbors database. It is build from the ground up to scale to millions of high-dimensional vectors while providing low latency and high throughput. AnnDB uses a custom implementation of HNSW [1] to make search in high-dimensional vector spaces fast. It splits each dataset and its underlying index into partitions. Partitions are distributed and replicated across nodes in the cluster using Raft protocol [2] to ensure high availability and data durability in case of node failures. Search is performed in a map-reduce like fashion. Node that receives a search request from the client samples a node for each partition and sends partition search request to that node. Each of these nodes then searches requested partitions and aggregates results locally before sending it to the driver node which re-aggregates responses from all partitions and sends the result to the client.
raft distributed-database approximate-nearest-neighbor-search similarity-search hnswThe dHash is the algorithm of image fingerprinting that can be used to measure the similarity of two images. The IDHash is the new algorithm that has some improvements over dHash -- I'll describe it further. You can read about the dHash and perceptual hashing in the article "Kind of Like That" at "The Hacker Factor Blog" (21 January 2013). The idea is that you resize the original image to 8x9 and then convert it to 8x8 array of bits -- each tells if the corresponding segment of the image is brighter or darker than the one on the right (or left). Then you apply the Hamming distance to such arrays to measure how much they are different.
rubygem fingerprint fingerprints similarity-measures image-comparison perceptual-hashing similarity-search dhashAt Qdrant, we have one goal: make metric learning more practical. This listing is in line with this purpose, and we aim at providing a concise yet useful list of awesomeness around metric learning. It is intended to be inspirational for productivity rather than serve as a full bibliography. If you find it useful or like it in some other way, you may want to join our Discord server, where we are running a paper reading club on metric learning.
tutorials survey recommendation-system awesome-list metric-learning semantic-similarity similarity-search anomaly-detection
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
Add Projects.