UIMA - Unstructured information management architecture

  •        2004

UIMA analyzes large volumes of unstructured information in order to discover knowledge that is relevant to an end user. It is a framework with different set of components. The components include Language Identification, Language specific segmentation, Sentence boundary detection, Entity detection (person/place names) etc. The framework manages these components and the data flows between them. It could detects entities like email addresses, URLs, phone numbers, zip codes or any other entity based on regular expressions.


http://uima.apache.org/

Tags
Implementation
License
Platform

   




Related Projects

SMILA - Unified information access architecture


SMILA is an extensible framework for building search solutions to access unstructured information in the enterprise. Besides providing essential infrastructure components and services, SMILA also delivers ready-to-use add-on components, like connectors to most relevant data sources. Using the framework as their basis will enable developers to concentrate on the creation of higher value solutions, like semantic driven applications etc.

OpenPipe - Document Pipeline


OpenPipe is an open source scalable platform for manipulating a stream of documents. A pipeline is an ordered set of steps / operations performed on a document to convert from its raw form to something ready to be put into the index. The operations performed on documents include language detection, field manipulation, POS tagging, entity extraction or submitting the document to a search engine.

Aperture - Java framework for getting data and metadata


Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems. It could crawl and extract information from File system, Websites, Mail boxes and Mail servers. It supports various file formats like Office, PDF, Zip and lot more. Metadata information is extracted from image files. Aperture has a strong focus on semantics, metadata extracted could be mapped to predefined properties.

Gate - General Architecture for Text Engineering


GATE excels at text analysis of all shapes and sizes. It provides support for diverse language processing tasks such as parsers, morphology, tagging, Information Retrieval tools, Information Extraction components for various languages, and many others. It provides support to measure, evaluate, model and persist the data structure. It could analyze text or speech. It has built-in support for machine learning and also adds support for different implementation of machine learning via plugin.

Constellio - Enterprise Search engine


Constellio Open Source Enterprise Search is based on Apache Solr and using Google Search Appliances connectors architecture, it allows, with a single click, to find all relevant content in your organization (Web, email, ECM, CRM etc.).


Tikka - A content analysis toolkit


Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

cider - "Content Integration Framework: Document Extraction and Retrieval" - A document parser framework that stores parsed entities into jena ( http://jena


"Content Integration Framework: Document Extraction and Retrieval" - A document parser framework that stores parsed entities into jena ( http://jena.sourceforge.net/ ) RDF vocabularies and provides knowledge-base enhanced semantic ananlysis of content. Annotated content can be used by search engines to present content navigation which will be implemented in the YaCy Search Engine

documents4j - Java library for converting documents into another document format


documents4j is a Java library for converting documents into another document format. This is achieved by delegating the conversion to any native application which understands the conversion of the given file into the desired target format.

Apache POI - Java API To Access Microsoft Document File Formats


APIs for manipulating various file formats based upon Open Office XML (ECMA-376) and Microsoft's OLE 2 Compound Document formats using pure Java. Apache POI is your Java Excel, Word and PowerPoint solution. We have a complete API for porting other OOXML and OLE 2 Compound Document formats and welcome others to participate.

docx4j - JAXB-based Java library for Word docx, Powerpoint pptx, and Excel xlsx files


docx4j is a library which helps you to work with the Office OpenXML file format as used in docx documents, pptx presentations, and xlsx spreadsheets.

Hydra - Distributed processing framework for search solutions


Hydra is designed to give the search solution the tools necessary to modify the data that is to be indexed in an efficient and flexible way. This is done by providing a scalable and efficient pipeline which the documents will have to pass through before being indexed into the search engine. Architecturally Hydra sits in between the search engine and the source integration.

Supplier Document Connector


Supplier Document Connector allows automotive suppliers to exchange necessary EDI documents with a major automotive partner/client. It uses HTTPS to securely connect directly to Daimler-Chrysler's EBMX, avoiding a third party network or service provider.

TCPDF - PHP class for generating PDF


TCPDF is a PHP class for generating PDF documents without requiring external extensions. TCPDF Supports UTF-8, Unicode, RTL languages, XHTML, Javascript, digital signatures, barcodes and much more.

TWiki - Wiki and Web 2.0 Application Platform


TWiki is a flexible, powerful, and easy to use enterprise wiki, enterprise collaboration platform, and web application platform. It is a Structured Wiki, typically used to run a project development space, a document management system, a knowledge base, or any other groupware tool, on an intranet, extranet or the Internet. TWiki is a cgi-bin script written in Perl. It reads a text file, hyperlinks it and converts it to HTML on the fly.

Grep's Template Library Connector


Adds the option to use a SharePoint document library as a template library for other document libraries. The templates are hierarchically shown in the "New" menu submenus of the document libraries, and instantiates exactly like common document templates.

PDFBox - Java PDF library


Apache PDFBox is an open source Java PDF library for working with PDF documents. This library allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. It provides support for adding bookmarks, fonts, text extraction, Encryption, PDF printing and lot more.

jPod - PDF manipulating and rendering framework


jPod is a PDF manipulating and rendering framework. It provides functionality to read, verify the document against the PDF specification. It also provides content stream and rendering framework. It could able to create new document and do incremental updates.

Ghostscript - Document Rendering and Conversion


Ghostscript is a rendering and conversion engine for page description languages, including Postscript and PDF. It has ability to convert PostScript language files to many raster formats, view them on displays, and print them on printers that don't have PostScript language capability built in.

ManifoldCF - Framework for connecting Source Content Repositories


ManifoldCF is an effort to provide an open source framework for connecting source content repositories like Microsoft Sharepoint, EMC Documentum FileNet, LiveLink (OpenText), Patriarch, Meridio (Autonomy), Windows shares to target repositories or indexes such as Apache Solr, QBase (formerly MetaCarta). It could also retrieve content from file system, JDBC connector, RSS crawler, and web crawler.

Phoenix Information Extraction


Phoenix is an information extraction engine written in java. Controlled by rules (declared in xml), it extracts information form any XML document (unstructured XHTML/OpenOffice documents). Supports XPath, additional conditions and top-down decomposit