UIMA - Unstructured information management architecture

  •        1995

UIMA analyzes large volumes of unstructured information in order to discover knowledge that is relevant to an end user. It is a framework with different set of components. The components include Language Identification, Language specific segmentation, Sentence boundary detection, Entity detection (person/place names) etc. The framework manages these components and the data flows between them. It could detects entities like email addresses, URLs, phone numbers, zip codes or any other entity based on regular expressions.




Related Projects

SMILA - Unified information access architecture

SMILA is an extensible framework for building search solutions to access unstructured information in the enterprise. Besides providing essential infrastructure components and services, SMILA also delivers ready-to-use add-on components, like connectors to most relevant data sources. Using the framework as their basis will enable developers to concentrate on the creation of higher value solutions, like semantic driven applications etc.

OpenPipe - Document Pipeline

OpenPipe is an open source scalable platform for manipulating a stream of documents. A pipeline is an ordered set of steps / operations performed on a document to convert from its raw form to something ready to be put into the index. The operations performed on documents include language detection, field manipulation, POS tagging, entity extraction or submitting the document to a search engine.

Aperture - Java framework for getting data and metadata

Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems. It could crawl and extract information from File system, Websites, Mail boxes and Mail servers. It supports various file formats like Office, PDF, Zip and lot more. Metadata information is extracted from image files. Aperture has a strong focus on semantics, metadata extracted could be mapped to predefined properties.

Gate - General Architecture for Text Engineering

GATE excels at text analysis of all shapes and sizes. It provides support for diverse language processing tasks such as parsers, morphology, tagging, Information Retrieval tools, Information Extraction components for various languages, and many others. It provides support to measure, evaluate, model and persist the data structure. It could analyze text or speech. It has built-in support for machine learning and also adds support for different implementation of machine learning via plugin.

Constellio - Enterprise Search engine

Constellio Open Source Enterprise Search is based on Apache Solr and using Google Search Appliances connectors architecture, it allows, with a single click, to find all relevant content in your organization (Web, email, ECM, CRM etc.).

Tikka - A content analysis toolkit

Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

documents4j - Java library for converting documents into another document format

documents4j is a Java library for converting documents into another document format. This is achieved by delegating the conversion to any native application which understands the conversion of the given file into the desired target format.

Apache POI - Java API To Access Microsoft Document File Formats

APIs for manipulating various file formats based upon Open Office XML (ECMA-376) and Microsoft's OLE 2 Compound Document formats using pure Java. Apache POI is your Java Excel, Word and PowerPoint solution. We have a complete API for porting other OOXML and OLE 2 Compound Document formats and welcome others to participate.

docx4j - JAXB-based Java library for Word docx, Powerpoint pptx, and Excel xlsx files

docx4j is a library which helps you to work with the Office OpenXML file format as used in docx documents, pptx presentations, and xlsx spreadsheets.

Hydra - Distributed processing framework for search solutions

Hydra is designed to give the search solution the tools necessary to modify the data that is to be indexed in an efficient and flexible way. This is done by providing a scalable and efficient pipeline which the documents will have to pass through before being indexed into the search engine. Architecturally Hydra sits in between the search engine and the source integration.

Supplier Document Connector

Supplier Document Connector allows automotive suppliers to exchange necessary EDI documents with a major automotive partner/client. It uses HTTPS to securely connect directly to Daimler-Chrysler's EBMX, avoiding a third party network or service provider.

TCPDF - PHP class for generating PDF

TCPDF is a PHP class for generating PDF documents without requiring external extensions. TCPDF Supports UTF-8, Unicode, RTL languages, XHTML, Javascript, digital signatures, barcodes and much more.

TWiki - Wiki and Web 2.0 Application Platform

TWiki is a flexible, powerful, and easy to use enterprise wiki, enterprise collaboration platform, and web application platform. It is a Structured Wiki, typically used to run a project development space, a document management system, a knowledge base, or any other groupware tool, on an intranet, extranet or the Internet. TWiki is a cgi-bin script written in Perl. It reads a text file, hyperlinks it and converts it to HTML on the fly.

Grep's Template Library Connector

Adds the option to use a SharePoint document library as a template library for other document libraries. The templates are hierarchically shown in the "New" menu submenus of the document libraries, and instantiates exactly like common document templates.

PDFBox - Java PDF library

Apache PDFBox is an open source Java PDF library for working with PDF documents. This library allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. It provides support for adding bookmarks, fonts, text extraction, Encryption, PDF printing and lot more.

jPod - PDF manipulating and rendering framework

jPod is a PDF manipulating and rendering framework. It provides functionality to read, verify the document against the PDF specification. It also provides content stream and rendering framework. It could able to create new document and do incremental updates.

Ghostscript - Document Rendering and Conversion

Ghostscript is a rendering and conversion engine for page description languages, including Postscript and PDF. It has ability to convert PostScript language files to many raster formats, view them on displays, and print them on printers that don't have PostScript language capability built in.

ManifoldCF - Framework for connecting Source Content Repositories

ManifoldCF is an effort to provide an open source framework for connecting source content repositories like Microsoft Sharepoint, EMC Documentum FileNet, LiveLink (OpenText), Patriarch, Meridio (Autonomy), Windows shares to target repositories or indexes such as Apache Solr, QBase (formerly MetaCarta). It could also retrieve content from file system, JDBC connector, RSS crawler, and web crawler.

Phoenix Information Extraction

Phoenix is an information extraction engine written in java. Controlled by rules (declared in xml), it extracts information form any XML document (unstructured XHTML/OpenOffice documents). Supports XPath, additional conditions and top-down decomposit

JODConverter - Automates document conversions using OpenOffice

JODConverter automates conversions between office document formats using OpenOffice.org or LibreOffice. Supported formats include OpenDocument, PDF, RTF, HTML, Word, Excel, PowerPoint, and Flash. It can be used as a Java library, a command line tool, or a web application.