OpenPipe - Document Pipeline

  •        0

OpenPipe is an open source scalable platform for manipulating a stream of documents. A pipeline is an ordered set of steps / operations performed on a document to convert from its raw form to something ready to be put into the index. The operations performed on documents include language detection, field manipulation, POS tagging, entity extraction or submitting the document to a search engine.

OpenPipe has support to extract content from database and file system. It could extract content or metadata from any file formats.

http://openpipe.berlios.de/

Tags
Implementation
License
Platform

   




Related Projects

Gate - General Architecture for Text Engineering


GATE excels at text analysis of all shapes and sizes. It provides support for diverse language processing tasks such as parsers, morphology, tagging, Information Retrieval tools, Information Extraction components for various languages, and many others. It provides support to measure, evaluate, model and persist the data structure. It could analyze text or speech. It has built-in support for machine learning and also adds support for different implementation of machine learning via plugin.

Aperture - Java framework for getting data and metadata


Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems. It could crawl and extract information from File system, Websites, Mail boxes and Mail servers. It supports various file formats like Office, PDF, Zip and lot more. Metadata information is extracted from image files. Aperture has a strong focus on semantics, metadata extracted could be mapped to predefined properties.

SMILA - Unified information access architecture


SMILA is an extensible framework for building search solutions to access unstructured information in the enterprise. Besides providing essential infrastructure components and services, SMILA also delivers ready-to-use add-on components, like connectors to most relevant data sources. Using the framework as their basis will enable developers to concentrate on the creation of higher value solutions, like semantic driven applications etc.

UIMA - Unstructured information management architecture


UIMA analyzes large volumes of unstructured information in order to discover knowledge that is relevant to an end user. It is a framework with different set of components. The components include Language Identification, Language specific segmentation, Sentence boundary detection, Entity detection (person/place names) etc. The framework manages these components and the data flows between them.

Hydra - Distributed processing framework for search solutions


Hydra is designed to give the search solution the tools necessary to modify the data that is to be indexed in an efficient and flexible way. This is done by providing a scalable and efficient pipeline which the documents will have to pass through before being indexed into the search engine. Architecturally Hydra sits in between the search engine and the source integration.

tif - Text Interchange Formats


This package describes and validates formats for storing common object arising in text analysis as native R objects. Representations of a text corpus, document term matrix, and tokenized text are included. The tokenized text format is extensible to include other annotations. There are two versions of the corpus and tokens objects; packages should accept both and return or coerce to at least one of these.corpus (data frame) - A valid corpus data frame object is a data frame with at least two columns. The first column is called doc_id and is a character vector with UTF-8 encoding. Document ids must be unique. The second column is called text and must also be a character vector in UTF-8 encoding. Each individual document is represented by a single row in the data frame. Addition document-level metadata columns and corpus level attributes are allowed but not required.

Tikka - A content analysis toolkit


Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

nlp - Natural language processing tools for text generation, search and analysis.


Natural language processing tools for text generation, search and analysis.

Constellio - Enterprise Search engine


Constellio Open Source Enterprise Search is based on Apache Solr and using Google Search Appliances connectors architecture, it allows, with a single click, to find all relevant content in your organization (Web, email, ECM, CRM etc.).

TextTeaser - Automatic Summarization Algorithm


TextTeaser is an automatic summarization algorithm that combines the power of natural language processing and machine learning to produce good results. It can provide provide a gist of an article, Better previews in news readers.

japanese-nlptools - Tools for NLP-related analysis of Japanese text


Tools for NLP-related analysis of Japanese text

ArabicNLP - Collection of various Arabic NLP and Text Processing Scripts and Utilities


Collection of various Arabic NLP and Text Processing Scripts and Utilities

bogofilter -- Fast Bayesian Spam Filter


Bogofilter is a mail filter that classifies mail as spam or ham (non-spam) by a statistical analysis of the message's header and content (body). The program is able to learn from the user's classifications and corrections. Bogofilter provides processing for plain text and HTML. It supports multi-part MIME messages with decoding of base64, quoted-printable, and uuencoded text and ignores attachments, such as images.

doorstop - Requirements management using version control.


- talks: [GRDevDay](https://speakerdeck.com/jacebrowning/doorstop-requirements-management-using-python-and-version-control), [BarCamp](https://speakerdeck.com/jacebrowning/strip-searched-a-rough-introduction-to-requirements-management)- sample: [Generated HTML](http://doorstop.info/reqs/index.html)- documentation: [API](http://doorstop.info/docs/index.html), [Demo](http://nbviewer.ipython.org/gist/jacebrowning/9754157)Getting Started===============Requirements------------* Python 3.3+* A version

NCrawler


Simple and very efficient multithreaded web crawler with pipeline based processing written in C#. Contains HTML, Text, PDF, and IFilter document processors and language detection(Google). Easy to add pipeline steps to extract, use and alter information.

interest-graph - Interest Graph


This service automatically analyzes the content of a document or piece of text and reports the interests present in the article. An interest is a non-hierarchical, single-phrase summary of the thematic content of a piece of text; examples include Functional Programming, Celebrity Gossip, or Flowers. At Prismatic, we’ve been using interests to automatically analyze the content of text in order to help connect people with the content they find interesting. Our interest graph can automatically anal

Automatic-Text-Summarizer


Automatic Document Summarizer using Bipartite HITS, Natural Language Processing (NLP)

html5-from-with-anti-spam-and-placeholder


<!DOCTYPE HTML> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <title>Untitled Document</title> <style type="text/css"> <!-- body { font: "Palatino Linotype", "Book Antiqua", Palatino, serif; background: 0; margin: 0; padding: 0; color: #F00; } </style> </head> <body> <form action="" method="post"> <tr> <td><p> </p> <p>   Name</p> <p> <input type = "text" maxlength = "30" required placeholder =

content-structure - Code from "Incorporating Content Structure into Text Analysis Applications"


Code from "Incorporating Content Structure into Text Analysis Applications"