Hydra - Distributed processing framework for search solutions

  •        0

Hydra is designed to give the search solution the tools necessary to modify the data that is to be indexed in an efficient and flexible way. This is done by providing a scalable and efficient pipeline which the documents will have to pass through before being indexed into the search engine. Architecturally Hydra sits in between the search engine and the source integration.

A common use-case would be to use Apache ManifoldCF to crawl a filesystem and send the documents to Hydra which in turn will process and dispatch processed documents to Solr for indexing.




Related Projects

tif - Text Interchange Formats

This package describes and validates formats for storing common object arising in text analysis as native R objects. Representations of a text corpus, document term matrix, and tokenized text are included. The tokenized text format is extensible to include other annotations. There are two versions of the corpus and tokens objects; packages should accept both and return or coerce to at least one of these.corpus (data frame) - A valid corpus data frame object is a data frame with at least two columns. The first column is called doc_id and is a character vector with UTF-8 encoding. Document ids must be unique. The second column is called text and must also be a character vector in UTF-8 encoding. Each individual document is represented by a single row in the data frame. Addition document-level metadata columns and corpus level attributes are allowed but not required.

Tikka - A content analysis toolkit

Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

OpenPipe - Document Pipeline

OpenPipe is an open source scalable platform for manipulating a stream of documents. A pipeline is an ordered set of steps / operations performed on a document to convert from its raw form to something ready to be put into the index. The operations performed on documents include language detection, field manipulation, POS tagging, entity extraction or submitting the document to a search engine.

LibreOffice - The Document foundation

LibreOffice is the free power-packed Open Source personal productivity suite for Windows, Macintosh and Linux. LibreOffice is the perfect choice for home users, businesses, government and other organizations. It's native file format is the ISO standardized ODF (Open Document Format), but LibreOffice can open and save Microsoft Word, PowerPoint and Excel files, as well as many other formats, bringing you the widest-available compatibility with other products.

doorstop - Requirements management using version control.

- talks: [GRDevDay](https://speakerdeck.com/jacebrowning/doorstop-requirements-management-using-python-and-version-control), [BarCamp](https://speakerdeck.com/jacebrowning/strip-searched-a-rough-introduction-to-requirements-management)- sample: [Generated HTML](http://doorstop.info/reqs/index.html)- documentation: [API](http://doorstop.info/docs/index.html), [Demo](http://nbviewer.ipython.org/gist/jacebrowning/9754157)Getting Started===============Requirements------------* Python 3.3+* A version

layout-analysis - Document layout analysis tools

Document layout analysis tools

JODConverter - Automates document conversions using OpenOffice

JODConverter automates conversions between office document formats using OpenOffice.org or LibreOffice. Supported formats include OpenDocument, PDF, RTF, HTML, Word, Excel, PowerPoint, and Flash. It can be used as a Java library, a command line tool, or a web application.

Ghostscript - Document Rendering and Conversion

Ghostscript is a rendering and conversion engine for page description languages, including Postscript and PDF. It has ability to convert PostScript language files to many raster formats, view them on displays, and print them on printers that don't have PostScript language capability built in.

Bagri - XML/Document DB on top of distributed cache

Bagri is a Document Database built on top of distributed cache solution like Hazelcast or Coherence. The system allows to process semi-structured schema-less documents and perform distributed queries on them in real-time. It scales horizontally very well with use of data sharding, when all documents are distributed evenly between distributed cache partitions.

Behemoth - Large Scale Document Processing based on Apache Hadoop

Behemoth is an open source platform for large scale document processing based on Apache Hadoop. It consists of a simple annotation-based implementation of a document and a number of modules operating on these documents. One of the main aspects of Behemoth is to simplify the deployment of document analysers on a large scale.


Basic Ribbon based document viewer for Windows XP, Vista and Windows 7. Supports opening FlowDocument, Html, Rich text and Text document formats.


The Vodoo/Stream project let users to define transducers dedicated to document analysis. Such transducers describe how fragments are matched and transformed. Finally a document can be an XML fragment, a free text or something else depending on extensions

teaching-dca - Course in Document and Content Analysis.

Course in Document and Content Analysis.

FastFish.Analysis - This is a Analysis source code document

This is a Analysis source code document

Infinit.e - The first Open Source document analysis platform

The first Open Source document analysis platform

underverse - A JSON-based document storage and analysis module for Python

A JSON-based document storage and analysis module for Python

thunderbolt - Graph-based document analysis system

Graph-based document analysis system

SharePoint 2010 Folder To Document Set Conversion Fix

Folder To Document Set Conversion Fix repairs the MS SharePoint 2010 bug where folders are not correctly converted into document sets during a content type property change. This will fix existing problem doc sets on the next update to the doc set properties. This includes a ...

Pandoc - General Markup Converter

Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. It an convert documents in markdown, reStructuredText, textile, HTML, DocBook, or LaTeX to HTML formats, Word processor formats, PDF and other markup formats.