Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. It extracts text from following file formats. 

<ul>
<li>HyperText Markup Language</li>
<li>XML and derived formats</li>
<li>Microsoft Office document formats</li>
<li>OpenDocument Format</li>
<li>Portable Document Format</li>
<li>Electronic Publication Format</li>
<li>Rich Text Format</li>
<li>Compression and packaging formats</li>
<li>Text formats</li>
<li>Audio formats</li>
<li>Image formats</li>
<li>Video formats</li>
<li>Java class files and archives</li>
<li>The mbox format</li>
</ul>

Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

Tikka - A content analysis toolkit

Ghostscript is a rendering and conversion engine for page description languages, including Postscript and PDF. It has ability to convert PostScript language files to many raster formats, view them on displays, and print them on printers that don't have PostScript language capability built in. 

Ghostscript - Document Rendering and Conversion

APIs for manipulating various file formats based upon Open Office XML (ECMA-376) and Microsoft's OLE 2 Compound Document formats using pure Java. Apache POI is your Java Excel, Word and PowerPoint solution. We have a complete API for porting other OOXML and OLE 2 Compound Document formats and welcome others to participate.

Apache POI - Java API To Access Microsoft Document File Formats

Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. Pandoc can read Markdown, CommonMark, PHP Markdown Extra, GitHub-Flavored Markdown, MultiMarkdown, and (subsets of) Textile, reStructuredText, HTML, LaTeX, MediaWiki markup, TWiki markup, TikiWiki markup, Creole 1.0, Haddock markup, OPML, Emacs Org mode, DocBook, JATS, Muse, txt2tags, Vimwiki, EPUB, ODT, and Word docx.
 
Pandoc can write plain text, Markdown, CommonMark, PHP Markdown Extra, GitHub-Flavored Markdown, MultiMarkdown, reStructuredText, XHTML, HTML5, LaTeX (including beamer slide shows), ConTeXt, RTF, OPML, DocBook, JATS, OpenDocument, ODT, Word docx, GNU Texinfo, MediaWiki markup, DokuWiki markup, ZimWiki markup, Haddock markup, EPUB (v2 or v3), FictionBook2, Textile, groff man, groff ms, Emacs Org mode, AsciiDoc, InDesign ICML, TEI Simple, Muse, PowerPoint slide shows and Slidy, Slideous, DZSlides, reveal.js or S5 HTML slide shows. It can also produce PDF output on systems where LaTeX, ConTeXt, pdfroff, wkhtmltopdf, prince, or weasyprint is installed.

Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. Pandoc can read Markdown, CommonMark, PHP Markdown Extra, GitHub-Flavored Markdown, MultiMarkdown, and (subsets of) Textile, reStructuredText, HTML, LaTeX, MediaWiki markup, TWiki markup, TikiWiki markup, Creole 1.0, Haddock markup, OPML, Emacs Org mode, DocBook, JATS, Muse, txt2tags, Vimwiki, EPUB, ODT, and Word docx.

Pandoc - Universal markup converter

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Scrapy - Web crawling & scraping framework for Python

Apache PDFBox is an open source Java PDF library for working with PDF documents. This library allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. It provides support for adding bookmarks, fonts, text extraction, Encryption, PDF printing and lot more. It has <A HREF="http://pdfbox.apache.org/userguide/dot_net.html">.NET support</A>. 

 Form filling is one of most important feature. It helps to fill in form data FDF and XFDF. It has command line utlities for most of the jobs. For example PDFToImage utility create an image for every page in the PDF document. 
 
 PDF documents could be splitted to multiple documents and also multiple PDF documents could be merged to one. Lucene Search Engine is integrated to do full text search.

Apache PDFBox is an open source Java PDF library for working with PDF documents. This library allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. It provides support for adding bookmarks, fonts, text extraction, Encryption, PDF printing and lot more.

PDFBox - Java PDF library

iText is one of the popular and widely used PDF library. It is used to generate PDF documents dynamically. Mostly web developers will love it to generate PDF documents and reports based on data from an XML file or a database and serves it to the browser. It has support of adding bookmarks, watermarks, Encryption, Form filling and lot more. 

 iText helps to convert PDF to text and it is also capable to generate PDF from XML and HTML. It could digitally sign the document.

iText is one of the popular and widely used PDF library. It is used to generate PDF documents dynamically. Mostly web developers will love it to generate PDF documents and reports based on data from an XML file or a database and serves it to the browser. It has support of adding bookmarks, watermarks, Encryption, Form filling and lot more.

iText - Java PDF library

docx4j is a library which helps you to work with the Office OpenXML file format as used in docx documents, pptx presentations, and xlsx spreadsheets.
 
It supports: 
<ul>
<li>Open existing docx/pptx/xlsx</li>
<li>Create new docx/pptx/xlsx</li>
<li>Programmatically manipulate docx/pptx/xlsx (anything the file format allows)</li>
<li>CustomXML binding (with support for pictures, rich text, checkboxes, and OpenDoPE extensions for repeats &amp; conditionals, and importing XHTML)</li>
<li>Export as HTML</li>
<li>Export as PDF (using Plutext's PDF Converter, or use docx4j-export-FO project)</li>
<li>Produce/consume Word 2007's xmlPackage (pkg) format</li>
<li>Apply transforms, including common filters</li>
<li>Font support (font substitution, and use of any fonts embedded in the document)</li>
</ul>

docx4j is a library which helps you to work with the Office OpenXML file format as used in docx documents, pptx presentations, and xlsx spreadsheets.

docx4j - JAXB-based Java library for Word docx, Powerpoint pptx, and Excel xlsx files

PDFClown is a PDF library helps to generate, read and edit PDF. It helps to split and merge the PDF documents. It has support to add Images, Fonts, Barcodes, Bookmarks, Annotations, Form fields like check box, button, list box etc, Compression, text extraction. It has .NET version developed for mono. 
 
 Source forge: <A HREF="http://sourceforge.net/projects/clown/" target="_blank">http://sourceforge.net/projects/clown/</A>

PDFClown is a PDF library helps to generate, read and edit PDF. It helps to split and merge the PDF documents. It has support to add Images, Fonts, Barcodes, Bookmarks, Annotations, Form fields like checkbox, button, list box etc, Compression, text extraction.

PDFClown - PDF library

GATE excels at text analysis of all shapes and sizes. It provides support for diverse language processing tasks such as parsers, morphology, tagging, Information Retrieval tools, Information Extraction components for various languages, and many others. It provides support to measure, evaluate, model and persist the data structure. It could analyze text or speech. It has built-in support for machine learning and also adds support for different implementation of machine learning via plugin. 
 It has family of products: 
 GATE Developer: An integrated development environment for language processing components bundled with a very widely used Information Extraction system and a comprehensive set of other plugins. 
 
 GATE Teamware: A collaborative annotation environment for factory-style semantic annotation projects built around a workflow engine and a heavily-optimized backend service infrastructure. 

 GATE Embedded: An object library optimized for inclusion in diverse applications giving access to all the services used by GATE Developer and more.
 <img src="/AppImages/Article/gate_img1.jpg" alt="" class="float-center">

GATE excels at text analysis of all shapes and sizes. It provides support for diverse language processing tasks such as parsers, morphology, tagging, Information Retrieval tools, Information Extraction components for various languages, and many others. It provides support to measure, evaluate, model and persist the data structure. It could analyze text or speech. It has built-in support for machine learning and also adds support for different implementation of machine learning via plugin.

Gate - General Architecture for Text Engineering

SMILA is an extensible framework for building search solutions to access unstructured information in the enterprise. Besides providing essential infrastructure components and services, SMILA also delivers ready-to-use add-on components, like connectors to most relevant data sources. Using the framework as their basis will enable developers to concentrate on the creation of higher value solutions, like semantic driven applications etc. It has a crawler / agent, which pushes the data. The data is then	processed by various filters. BPEL workflows could be created using its WorkerManager. The pipeline could be well managed by this workflow engine. 
 <img src="/AppImages/Article/smila_img1.jpg" alt="" class="float-center">

SMILA is an extensible framework for building search solutions to access unstructured information in the enterprise. Besides providing essential infrastructure components and services, SMILA also delivers ready-to-use add-on components, like connectors to most relevant data sources. Using the framework as their basis will enable developers to concentrate on the creation of higher value solutions, like semantic driven applications etc.

SMILA - Unified information access architecture

HexaPDF is a pure Ruby library with an accompanying application for working with PDF files. It supports Creating new PDF files, Manipulating existing PDF files,
Merging multiple PDF files into one, Extracting meta information, text, images and files from PDF files,
Securing PDF files by encrypting them and
optimizing PDF files for smaller file size or other criteria.

HexaPDF - A Versatile PDF Creation and Manipulation Library for Ruby

GOCR is an OCR (Optical Character Recognition) program, developed under the GNU Public License. It converts scanned images of text back to text files. Joerg Schulenburg started the program, and now leads a team of developers.GOCR can be used with different front-ends, which makes it very easy to port to different OSes and architectures. It can open many different image formats, and its quality have been improving in a daily basis.

GOCR is an OCR (Optical Character Recognition) program, developed under the GNU Public License. It converts scanned images of text back to text files. Joerg Schulenburg started the program, and now leads a team of developers.

GOCR

PDFsharp is the Open Source .NET library that easily creates and processes PDF documents on the fly from any .NET language. The same drawing routines can be used to create PDF documents, draw on the screen, or send output to any printer. Neither Adobe's PDF Library nor Acrobat are required. Its features include:
 <ul><li>Creates PDF documents on the fly from any .NET language<1/li><li>Easy to understand object model to compose documents</li><li>One source code for drawing on a PDF page as well as in a window or on the printer</li><li>Modify, merge, and split existing PDF files</li><li>Images with transparency (color mask, monochrome mask, alpha mask)</li><li>Newly designed from scratch and written entirely in C#</li><li>The graphical classes go well with .NET </li></ul>

PDFsharp is the Open Source .NET library that easily creates and processes PDF documents on the fly from any .NET language. The same drawing routines can be used to create PDF documents, draw on the screen, or send output to any printer. Neither Adobe's PDF Library nor Acrobat are required.

PDFSharp - Create and process PDF in .NET

Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems. It could crawl and extract information from File system, Websites, Mail boxes and Mail servers. It supports various file formats like Office, PDF, Zip and lot more. Metadata information is extracted from image files. Aperture has a strong focus on semantics, metadata extracted could be mapped to predefined properties.

Aperture - Java framework for getting data and metadata

Discover open source projects across all platforms

Projects