Displaying 1 to 3 from 3 results

Hydra - Distributed processing framework for search solutions

  •    Java

Hydra is designed to give the search solution the tools necessary to modify the data that is to be indexed in an efficient and flexible way. This is done by providing a scalable and efficient pipeline which the documents will have to pass through before being indexed into the search engine. Architecturally Hydra sits in between the search engine and the source integration.

Apache Tika - A content analysis toolkit

  •    Java

The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.

robin - RObust document image BINarization

  •    Python

robin is a RObust document image BINarization tool, written in Python. robin requires Python v3.5+ to run.