docconv - Converts PDF, DOC, DOCX, XML, HTML, RTF, etc to plain text

  •        35

A Go wrapper library to convert PDF, DOC, DOCX, XML, HTML, RTF, ODT, Pages documents and images (see optional dependencies below) to plain text. Note for returning users: the Go code path for this pkg been moved to code.sajari.com/docconv. Follow the installation instructions to checkout a version of the code in the correct place.

https://github.com/sajari/docconv

Tags
Implementation
License
Platform

   




Related Projects

Pandoc - General Markup Converter

  •    Haskell

Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. It an convert documents in markdown, reStructuredText, textile, HTML, DocBook, or LaTeX to HTML formats, Word processor formats, PDF and other markup formats.

Pandoc - Universal markup converter

  •    Haskell

Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. Pandoc can read Markdown, CommonMark, PHP Markdown Extra, GitHub-Flavored Markdown, MultiMarkdown, and (subsets of) Textile, reStructuredText, HTML, LaTeX, MediaWiki markup, TWiki markup, TikiWiki markup, Creole 1.0, Haddock markup, OPML, Emacs Org mode, DocBook, JATS, Muse, txt2tags, Vimwiki, EPUB, ODT, and Word docx.

PHPWord - A pure PHP library for reading and writing word processing documents

  •    PHP

PHPWord is a library written in pure PHP that provides a set of classes to write to and read from different document file formats. The current version of PHPWord supports Microsoft Office Open XML (OOXML or OpenXML), OASIS Open Document Format for Office Applications (OpenDocument or ODF), Rich Text Format (RTF), HTML, and PDF. PHPWord is an open source project licensed under the terms of LGPL version 3. PHPWord is aimed to be a high quality software product by incorporating continuous integration and unit testing. You can learn more about PHPWord by reading the Developers' Documentation.

docx4j - JAXB-based Java library for Word docx, Powerpoint pptx, and Excel xlsx files

  •    Java

docx4j is a library which helps you to work with the Office OpenXML file format as used in docx documents, pptx presentations, and xlsx spreadsheets.

Virtual Image Printer driver

  •    C

Virtual ImagePrinter is based on the Microsoft universal printer driver. ImagePrinter can print to file any printable document in your Windows system to the one or many BMP, PNG , JPG, TIFF or PDF files. Convert word to pdf, word to jpg and convert DOC, DOCX, PDF, TXT, HTM and RTF files to Image format Please visit http://code-industry.net for more information.


LaTeX to RTF converter

  •    C

LaTeX to RTF convertor that handles equations, figures, and cross-refe

ONLYOFFICE Desktop Editors - An office suite that combines text, spreadsheet and presentation editors allowing to create, view and edit local documents

  •    C

ONLYOFFICE Desktop Editors is a free and open source office suite comprises text documents, spreadsheets and presentations allowing to create, view and edit documents of any size and complexity, to easily switch to the online mode for real-time co-editing and collaboration. Features as reviewing, commenting and chat are available as well. Deal with multiple files within one and the same window thanks to the tab-based user interface

Jasper Reports

  •    Java

JasperReports is the world's most popular open source reporting engine. It is entierly written in Java and it is able to use data coming from any kind of data source and produce pixel-perfect documents that can be viewed, printed or exported in a variety of document formats including HTML, PDF, Excel, OpenOffice and Word.

SharePoint Document Converter

  •    

SharePoint Document Converter solution gives a start on how we can leverage the Word automation Service to convert documents to formats that word can support. This project convert documents of type "docx" or "doc" to any possible file type that word support like to PDF, XPS, D...

textract - node

  •    HTML

A text extraction node module. In almost all cases above, what textract cares about is the mime type. So .html and .htm, both possessing the same mime type, will be extracted. Other extensions that share mime types with those above should also extract successfully. For example, application/vnd.ms-excel is the mime type for .xls, but also for 5 other file types.

XSL-FO Wysiwyg MiniScribus

  •    C++

XSL-FO Formatting markup WYSIWYG editor amp; PDF tree BookMark. XML document which is most often used as PDFs or RTF generator. It can read and edit 95% from Apache fop sample. Export to fo,pdf,rtf,tif fax, page, Import fo,html,page,odt OpenOffice 1-2

MOSS Document Converter

  •    

Microsoft Office SharePoint Server (MOSS) Document Converters with Word & Excel 2007 on the server. Converting Office 2003 file-types (doc, xls) to pdf and xps. Could easily be altered for work for docx and xlsx file-types. Desktop Automation on the Server: Previously, us...

JODConverter - Automates document conversions using OpenOffice

  •    Java

JODConverter automates conversions between office document formats using OpenOffice.org or LibreOffice. Supported formats include OpenDocument, PDF, RTF, HTML, Word, Excel, PowerPoint, and Flash. It can be used as a Java library, a command line tool, or a web application.

Nutch - Highly extensible, highly scalable Web crawler

  •    Java

Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

FoxyPreviewer

  •    

Export your Visual FoxPro reports to Images, RTF, PDF, HTML or XLS super easy! Send them by email! Enhance the look of your previews, and allow your users to decide how their report previews will be.

.NET Data Export Examples

  •    DotNet

This project is created to export data in C#,VB.NET from database,listview,command to PDF, Word,Excel,RTF,Html,XML,Access,DBF,SQL Script,SYLK,DIF,CSV,Clipboard

Apache XML Graphics Commons - Common components for Apache Batik and Apache FOP

  •    Java

Apache XML Graphics Commons is a library that consists of several reusable components used by Apache Batik and Apache FOP. Many of these components can easily be used separately outside the domains of SVG and XSL-FO. You will find components such as a PDF library, an RTF library, Graphics2D implementations that let you generate PDF and PostScript files and much more.

Information Extracter

  •    C++

A utility to extract meta-information (properties/comments) out of various file-types; e.g. HTML, PDF, RTF amp; various Office documents; OGG/MP3 files and JPEG/PNG/GIF images, which can be presented in various output formats (HTML, XML, LaTeX amp; plain t

yomu - Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)

  •    Ruby

Yomu is a library for extracting text and metadata from files and documents using the Apache Tika content analysis toolkit. For the complete list of supported formats, please visit the Apache Tika Supported Document Formats page.





We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.