Hydra - Distributed processing framework for search solutions

  •        0

Hydra is designed to give the search solution the tools necessary to modify the data that is to be indexed in an efficient and flexible way. This is done by providing a scalable and efficient pipeline which the documents will have to pass through before being indexed into the search engine. Architecturally Hydra sits in between the search engine and the source integration.

A common use-case would be to use Apache ManifoldCF to crawl a filesystem and send the documents to Hydra which in turn will process and dispatch processed documents to Solr for indexing.




comments powered by Disqus

Related Projects

HTK - Speech Recognition Toolkit

The Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and manipulating hidden Markov models. HTK is primarily used for speech recognition research although it has been used for numerous other applications including research into speech synthesis, character recognition and DNA sequencing. HTK is in use at hundreds of sites worldwide.

Text Analysis Markup System

Text Analysis Markup System (TAMS) is both a system of marking documents for qualitative analysis and a series of tools for mining information based on that syntax.

Ghostscript - Document Rendering and Conversion

Ghostscript is a rendering and conversion engine for page description languages, including Postscript and PDF. It has ability to convert PostScript language files to many raster formats, view them on displays, and print them on printers that don't have PostScript language capability built in.

Semantic Vectors - Creating and Searching Semantic Vector using Lucene

The Semantic Vectors package uses a Random Projection algorithm, a form of automatic semantic analysis. Other methods supported by the package include Latent Semantic Analysis (LSA) and Reflective Random Indexing. Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text. This library is used in semantic analysis and text mining.


SCAN (Smart Content Aggregation and Navigation) is a universal semantic content aggregator. It combines search, text analysis, tagging and metadata functions to provide new user experience of desktop navigation and document management.

S-Space - A scalable software library for semantic spaces

The S-Space Package is a collection of algorithms for building Semantic Spaces as well as a highly-scalable library for designing new distributional semantics algorithms. Distributional algorithms process text corpora and represent the semantic for words as high dimensional feature vectors.


OCRopus :- The open source document analysis and OCR system featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities.

Aperture - Java framework for getting data and metadata

Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems. It could crawl and extract information from File system, Websites, Mail boxes and Mail servers. It supports various file formats like Office, PDF, Zip and lot more. Metadata information is extracted from image files. Aperture has a strong focus on semantics, metadata extracted could be mapped to predefined properties.

OpenOffice - leading open-source office software suite

OpenOffice is the leading open-source office software suite for word processing, spreadsheets, presentations, graphics, databases and more. It is available in many languages and works on all common computers. It stores all your data in an international open standard format and can also read and write files from other common office software packages.

ANTLR - ANother Tool for Language Recognition

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files. It's widely used to build languages, tools, and frameworks. From a grammar, ANTLR generates a parser that can build and walk parse trees. Twitter search uses ANTLR for query parsing, with over 2 billion queries a day.