NGT - Neighborhood Graph and Tree for Indexing High-dimensional Data

  •        24

NGT provides commands and a library for performing high-speed approximate nearest neighbor searches against a large volume of data (several million to several 10 million items of data) in high dimensional vector data space (several ten to several thousand dimensions). The index can be placed in shared memory. Using shared memory can reduce the amount of memory needed when multiple processes are using the same index. It can also improve the boot-up speed of an index for a large volume of registration data. Since changes become necessary at build time, please add the following parameter when executing "cmake" in order to use shared memory.

https://github.com/yahoojapan/NGT

Tags
Implementation
License
Platform

   




Related Projects

Sphinix - Search server

  •    C++

Sphinix is free open-source SQL full-text search engine. How do you implement full-text search for that 10+ million row table, keep up with the load, and stay relevant? Sphinx is good at those kinds of riddles.

Xapian - Search Engine Library

  •    C++

Xapian is an Open Source Search Engine Library. It is written in C++, with bindings to allow use from Perl, Python, PHP, Java, Tcl, C# and Ruby. Xapian is a highly adaptable toolkit which allows developers to easily add advanced indexing and search facilities to their own applications. It supports the Probabilistic Information Retrieval model and also supports a rich set of boolean query operators.

lopq - Training of Locally Optimized Product Quantization (LOPQ) models for approximate nearest neighbor search of high dimensional data in Python and Spark

  •    Python

This is Python training and testing code for Locally Optimized Product Quantization (LOPQ) models, as well as Spark scripts to scale training to hundreds of millions of vectors. The resulting model can be used in Python with code provided here or deployed via a Protobuf format to, e.g., search backends for high performance approximate nearest neighbor search.Locally Optimized Product Quantization (LOPQ) [1] is a hierarchical quantization algorithm that produces codes of configurable length for data points. These codes are efficient representations of the original vector and can be used in a variety of ways depending on the application, including as hashes that preserve locality, as a compressed vector from which an approximate vector in the data space can be reconstructed, and as a representation from which to compute an approximation of the Euclidean distance between points.

Rank KNN

  •    

A variation of KNN, classifying data using the aggregate ranking of proximity to each data column - thus no n-dimensional normalised distance function is needed

Constellio - Enterprise Search engine

  •    Java

Constellio Open Source Enterprise Search is based on Apache Solr and using Google Search Appliances connectors architecture, it allows, with a single click, to find all relevant content in your organization (Web, email, ECM, CRM etc.).


SwiftGraph - A Graph Data Structure in Pure Swift

  •    Swift

SwiftGraph is a pure Swift (no Cocoa) implementation of a graph data structure, appropriate for use on all platforms Swift supports (iOS, macOS, Linux, etc.). It includes support for weighted, unweighted, directed, and undirected graphs. It uses generics to abstract away both the type of the vertices, and the type of the weights. It includes copious in-source documentation, unit tests, as well as search functions for doing things like breadth-first search, depth-first search, and Dijkstra's algorithm. Further, it includes utility functions for topological sort, Jarnik's algorithm to find a minimum-spanning tree, detecting a DAG (directed-acyclic-graph), and enumerating all cycles.

arxiv-sanity-preserver - Web interface for browsing, search and filtering recent arxiv submissions

  •    Python

This project is a web interface that attempts to tame the overwhelming flood of papers on Arxiv. It allows researchers to keep track of recent papers, search for papers, sort papers by similarity to any paper, see recent popular papers, to add papers to a personal library, and to get personalized recommendations of (new or old) Arxiv papers. This code is currently running live at www.arxiv-sanity.com/, where it's serving 25,000+ Arxiv papers from Machine Learning (cs.[CV|AI|CL|LG|NE]/stat.ML) over the last ~3 years. With this code base you could replicate the website to any of your favorite subsets of Arxiv by simply changing the categories in fetch_papers.py. Indexing code. Uses Arxiv API to download the most recent papers in any categories you like, and then downloads all papers, extracts all text, creates tfidf vectors based on the content of each paper. This code is therefore concerned with the backend scraping and computation: building up a database of arxiv papers, calculating content vectors, creating thumbnails, computing SVMs for people, etc.

DeepVideoAnalytics - A distributed visual search and visual data analytics platform.

  •    Python

Deep Video Analytics is a platform for indexing and extracting information from videos and images. With latest version of docker installed correctly, you can run Deep Video Analytics in minutes locally (even without a GPU) using a single command. Deep Video Analytics implements a client-server architecture pattern, where clients can access state of the server via a REST API. For uploading, processing data, training models, performing queries, i.e. mutating the state clients can send DVAPQL (Deep Video Analytics Processing and Query Language) formatted as JSON. The query represents a directed acyclic graph of operations.

RediSearch - FullText Search module for redis

  •    C

Redisearch implements a search engine on top of Redis, but unlike other redis search libraries, it does not use internal data structures like sorted sets. Inverted indexes are stored as a special compressed data type that allows for fast indexing and search speed, and low memory footprint.

Solr - Blazing-fast, open source enterprise search platform

  •    Java

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites.

ElasticSearch - Distributed, RESTful search and analytics engine

  •    Java

Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data so you can discover the expected and uncover the unexpected.

Arachnode.net

  •    CSharp

An open source .NET web crawler written in C# using SQL 2005/2008. Arachnode.net is a complete and comprehensive .NET web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages.

Terrier - Information Retrieval Platform

  •    Java

Terrier is a highly flexible, efficient, and effective open source search engine, readily deployable on large-scale collections of documents. Terrier implements state-of-the-art indexing and retrieval functionalities, and provides an ideal platform for the rapid development and evaluation of large-scale retrieval applications. Terrier can index large corpora of documents, and provides multiple indexing strategies, such as multi-pass, single-pass and large-scale MapReduce indexing.

faiss - A library for efficient similarity search and clustering of dense vectors.

  •    C++

UPDATE: As of July 30 2017, the license on Faiss was relaxed to BSD from CC-BY-NC. Read LICENSE for details. Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy. Some of the most useful algorithms are implemented on the GPU. It is developed by Facebook AI Research.

SoftwareBotany.Sunlight Word Aligned Hybrid Bit Vector Search Framework

  •    

The Software Botany Sunlight project is a search framework built using Word Aligned Hybrid Bit Vectors. Its sole purpose is to provide high performance in-memory searching of data using unknown combinations of indices. It is developed with .NET 4.0 using C#.

Whoosh - Python Search Library

  •    Python

Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python. Programmers can use it to easily add search functionality to their applications and websites. It has support of Fielded indexing, search, scoring, text analysis, storage, Pluggable scoring algorithm, Powerful query language and spell-checker.

LargeVis

  •    C++

#LargeVis This is the official implementation of the LargeVis model by the original authors, which is used to visualize large-scale and high-dimensional data (Tang, Liu, Zhang and Mei). It now supports visualizing both high-dimensional feature vectors and networks. The package also contains a very efficient algorithm for constructing K-nearest neighbor graph (K-NNG). Contact person: Jian Tang, tangjianpku@gmail.com. This work is done when the author is in Microsoft Research Asia.

Hydra - Distributed processing framework for search solutions

  •    Java

Hydra is designed to give the search solution the tools necessary to modify the data that is to be indexed in an efficient and flexible way. This is done by providing a scalable and efficient pipeline which the documents will have to pass through before being indexed into the search engine. Architecturally Hydra sits in between the search engine and the source integration.

Zebra - Indexing and Retrieval Engine

  •    searchengine

Zebra is a high-performance, general-purpose structured text indexing and retrieval engine. It can index records in XML/SGML, MARC, e-mail archives and many other formats and allows access to them through exact boolean search expressions and relevance-ranked free-text queries.

NearPy - Python framework for fast (approximated) nearest neighbour search in large, high-dimensional data sets using different locality-sensitive hashes

  •    Python

NearPy is a Python framework for fast (approximated) nearest neighbour search in high dimensional vector spaces using different locality-sensitive hashing methods. It allows to experiment and to evaluate new methods but is also production-ready. It comes with a redis storage adapter.