UIMA - Unstructured information management architecture
UIMA analyzes large volumes of unstructured information in order to discover knowledge that is relevant to an end user. It is a framework with different set of components. The components include Language Identification, Language specific segmentation, Sentence boundary detection, Entity detection (person/place names) etc. The framework manages these components and the data flows between them. It could detects entities like email addresses, URLs, phone numbers, zip codes or any other entity based on regular expressions.

comments powered by Disqus
Related Products
SMILA - Unified information access architecture
SMILA is an extensible framework for building search solutions to access unstructured information in the enterprise. Besides providing essential infrastructure components and services, SMILA also delivers ready-to-use add-on components, like connectors to most relevant data sources. Using the framework as their basis will enable developers to concentrate on the creation of higher value solutions, like semantic driven applications etc.
Aperture - Java framework for getting data and metadata
Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems. It could crawl and extract information from File system, Websites, Mail boxes and Mail servers. It supports various file formats like Office, PDF, Zip and lot more. Metadata information is extracted from image files. Aperture has a strong focus on semantics, metadata extracted could be mapped to predefined properties.
OpenPipe - Document Pipeline
OpenPipe is an open source scalable platform for manipulating a stream of documents. A pipeline is an ordered set of steps / operations performed on a document to convert from its raw form to something ready to be put into the index.
The operations performed on documents include language detection, field manipulation, POS tagging, entity extraction or submitting the document to a search engine.
Gate - General Architecture for Text Engineering
GATE excels at text analysis of all shapes and sizes. It provides support for diverse language processing tasks such as parsers, morphology, tagging, Information Retrieval tools, Information Extraction components for various languages, and many others. It provides support to measure, evaluate, model and persist the data structure. It could analyze text or speech. It has built-in support for machine learning and also adds support for different implementation of machine learning via plugin.
ManifoldCF - Framework for connecting Source Content Repositories
ManifoldCF is an effort to provide an open source framework for connecting source content repositories like Microsoft Sharepoint, EMC Documentum FileNet, LiveLink (OpenText), Patriarch, Meridio (Autonomy), Windows shares
to target repositories or indexes such as Apache Solr, QBase (formerly MetaCarta). It could also retrieve content from file system, JDBC connector, RSS crawler, and web crawler.
Google-refine - Tool for working with messy data
Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase. Google Refine is a web application but run on one's own machine and used by oneself. Its reconciliation support helps to link text names in your data to database identifiers.
PDFBox - Java PDF library
Apache PDFBox is an open source Java PDF library for working with PDF documents. This library allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. It provides support for adding bookmarks, fonts, text extraction, Encryption, PDF printing and lot more.
PDFClown - PDF library
PDFClown is a PDF library helps to generate, read and edit PDF. It helps to split and merge the PDF documents. It has support to add Images, Fonts, Barcodes, Bookmarks, Annotations, Form fields like checkbox, button, list box etc, Compression, text extraction.
Behemoth - Large Scale Document Processing based on Apache Hadoop
Behemoth is an open source platform for large scale document processing based on Apache Hadoop. It consists of a simple annotation-based implementation of a document and a number of modules operating on these documents. One of the main aspects of Behemoth is to simplify the deployment of document analysers on a large scale.
Constellio - Enterprise Search engine
Constellio Open Source Enterprise Search is based on Apache Solr and using Google Search Appliances connectors architecture, it allows, with a single click, to find all relevant content in your organization (Web, email, ECM, CRM etc.).