Vespa - Yahoo's big data serving engine

  •        197

Vespa is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time. Vespa is serving platform for Yahoo.com, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Gemini, Flickr.

Queries can use both structured filters and unstructured text search to select data. All the matching data is then ranked according to a ranking function - typically machine learned - to implement such use cases as search relevance, recommendation, targeting and personalization.

Vespa is scalable. System sizes up to hundreds of nodes handling tens of billions of documents are not uncommon, and no harder to set up and modify than single node systems. Since all system components, as well as stored data is redundant and self-correcting, hardware failures are not operational emergencies and can be handled by re-adding capacity when convenient.

Its feature include:

  • Text search - Combine structured query and text search to select data
  • Advanced Ranking - Machine learned ranking
  • Aggregation
  • Elastic, Scalable
  • High Availability
  • Auto repair data corruption
  • Simple HTTP API interface 
  • lot more...

http://vespa.ai/
https://github.com/vespa-engine/vespa

Tags
Implementation
License
Platform

   




Related Projects

ElasticSearch - Distributed, RESTful search and analytics engine


Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data so you can discover the expected and uncover the unexpected.

compass - Searchengine built on top of Lucene


Compass is a real time searchengine. It is built on top of lucene. It is transactional, distributed, supports Spring MVC, integrates with Hibernate.

Manticore Search - High performance full-text search engine with SQL and JSON support


Manticore Search is an open source high performance full-text search oriented engine. It is a fork of Sphinx Search. Manticore Search is written in C++. It means speed and low resource consumption, it means you don’t have to worry about a garbage collector that suddenly makes a trouble.

H2O - Fast Scalable Machine Learning API For Smarter Applications


H2O is for data scientists and application developers who need fast, in-memory scalable machine learning for smarter applications. H2O is an open source parallel processing engine for machine learning. Unlike traditional analytics tools, H2O provides a combination of extraordinary math, a high performance parallel architecture, and unrivaled ease of use.

Sensorbee - Lightweight stream processing engine for IoT


Sensorbee is designed for low-latency processing of streaming data at the edge of the network. IoT devices frequently generate large volumes of unstructured streaming data, such as video and audio streams. Even if the data streams are structured, they may be meaningless if their temporal characteristics are not considered. Cloud-based services are generally not good at processing these kinds of data. Preprocessing data streams before they are sent to the cloud makes large scale data processing in the cloud more efficient and reduces the usage of network bandwidth.


Katta - Lucene and more in the cloud.


Katta is a scalable, failure tolerant, distributed, data storage for real time access. Katta serves large, replicated, indices as shards to serve high loads and very large data sets. These indices can be of different type. Currently implementations are available for Lucene and Hadoop mapfiles.

Smile - Statistical Machine Intelligence & Learning Engine


Smile (Statistical Machine Intelligence and Learning Engine) is a fast and comprehensive machine learning, NLP, linear algebra, graph, interpolation, and visualization system in Java and Scala. With advanced data structures and algorithms, Smile delivers state-of-art performance.Smile covers every aspect of machine learning, including classification, regression, clustering, association rule mining, feature selection, manifold learning, multidimensional scaling, genetic algorithms, missing value imputation, efficient nearest neighbor search, etc.

SenseiDB - Distributed, Realtime, Semi-Structured Database from LinkedIn


Sensei is a distributed data system that was built to support many product initiatives at LinkedIn, including the real-time faceted search in LinkedIn Signal and the news feed and tabs on the Homepage. Sensei is both a search engine and a database. It is designed to query and navigate through documents that consist of unstructured text and well-formed and structured metadata. Sensei is both a search engine and a database.

Apache Storm - Distributed and fault-tolerant realtime computation


Storm is a distributed real time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real time processing what Hadoop did for batch processing. Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more.

Constellio - Enterprise Search engine


Constellio Open Source Enterprise Search is based on Apache Solr and using Google Search Appliances connectors architecture, it allows, with a single click, to find all relevant content in your organization (Web, email, ECM, CRM etc.).

HPCC System - Hadoop alternative


HPCC is a proven and battle-tested platform for manipulating, transforming, querying and data warehousing Big Data. It supports two type of configuration. Thor is responsible for consuming vast amounts of data, transforming, linking and indexing that data. It functions as a distributed file system with parallel processing power spread across the nodes. Roxie, the Data Delivery Engine, provides separate high-performance online query processing and data warehouse capabilities.

Apache Metron - Real-time Big Data Security


Metron integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis. Metron provides capabilities for log aggregation, full packet capture indexing, storage, advanced behavioral analytics and data enrichment, while applying the most current threat intelligence information to security telemetry within a single platform.

Lucene - A high-performance, full-featured text search engine library


Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Seeks - An Open Decentralized Platform for Collaborative Search, Filtering and content Curation


Seeks acts as a personalizing Web server or proxy between you and your data feeds. Connect most search engines, RSS/ATOM feeds, Twitter / Identica, Youtube / Dailymotion, Wikis, and basically any source of data, and Seeks will produce a fused personalized batch / stream of results to your queries. Its specific purpose is to regroup users whose queries are similar so they can share both the query results and their experience on these results.

Sphinix - Search server


Sphinix is free open-source SQL full-text search engine. How do you implement full-text search for that 10+ million row table, keep up with the load, and stay relevant? Sphinx is good at those kinds of riddles.

Summa


Summa is a fast modular and scalable search engine written in Java. Sports a flexible workflow system and incremental indexing and facetting. Summa is primarily developed by the State and University Library of Denmark.

Jumper - Collaborative search engine in PHP


Jumper 2.0 is a collaborative community search platform that revolutionizes search by crowdsourcing knowledge management powered by a shared bookmarking engine. It is easily and quickly deployed into a community of practice that benefits users with complex and specialized search requirements. Jumper delivers universal search of any databases, flat files, fileshares, content systems, web pages, blogs and wikis, even people - through one simple search box.

jstorm - Enterprise Stream Process Engine


Alibaba JStorm is an enterprise fast and stable streaming process engine. It runs program up to 4x faster than Apache Storm. It is easy to switch from record mode to mini-batch mode. It is not only a streaming process engine. It means one solution for real time requirement, whole realtime ecosystem.

snappydata - SnappyData: OLTP + OLAP Database built on Apache Spark


SnappyData is a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction processing) and OLAP (online analytical processing) in a single integrated cluster. We realize this platform through a seamless integration of Apache Spark (as a big data computational engine) with GemFire XD (as an in-memory transactional store with scale-out SQL semantics).

Searchdaimon - Enterprise Search


Searchdaimon is an open source search engine for corporate data and websites. It comes with a powerful administrator interface and can index websites and several common enterprise systems like SharePoint, Exchange, SQL databases, Windows file shares etc. It also supports many data sources (e.g., Word, PDF, Excel) and the possibility of faceted search, attribute navigation and collection sorting.