Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. It an convert documents in markdown, reStructuredText, textile, HTML, DocBook, or LaTeX to HTML formats, Word processor formats, PDF and other markup formats.
It can read
[markdown] and (subsets of) [Textile], [reStructuredText], [HTML],
[LaTeX], [MediaWiki markup], and [DocBook XML]; and it can write plain
text, [markdown], [reStructuredText], [XHTML], [HTML 5], [LaTeX]
(including [beamer] slide shows), [ConTeXt], [RTF], [DocBook XML],
[OpenDocument XML], [ODT], [Word docx], [GNU Texinfo], [MediaWiki
markup], [EPUB] (v2 or v3), [FictionBook2], [Textile], [groff man] pages, [Emacs
Org-Mode], [AsciiDoc], and [Slidy], [Slideous], [DZSlides], or [S5] HTML
slide shows. It can also produce [PDF] output on systems where LaTeX is
installed.
Tags | text-extraction document-conversion document markup text-to-pdf |
Implementation | Haskell |
License | GPLv2 |
Platform | Windows Linux |
Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. Pandoc can read Markdown, CommonMark, PHP Markdown Extra, GitHub-Flavored Markdown, MultiMarkdown, and (subsets of) Textile, reStructuredText, HTML, LaTeX, MediaWiki markup, TWiki markup, TikiWiki markup, Creole 1.0, Haddock markup, OPML, Emacs Org mode, DocBook, JATS, Muse, txt2tags, Vimwiki, EPUB, ODT, and Word docx.
document-conversion markup markup-converter text-extractionGhostscript is a rendering and conversion engine for page description languages, including Postscript and PDF. It has ability to convert PostScript language files to many raster formats, view them on displays, and print them on printers that don't have PostScript language capability built in.
document-conversion pdf-text-extraction text-extraction graphics pdf postscript printingApache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
text-extraction document-extraction office-documents open-xml export conversiondocuments4j is a Java library for converting documents into another document format. This is achieved by delegating the conversion to any native application which understands the conversion of the given file into the desired target format.
document-processing document-conversion text-extraction microsoft-documentsdocx4j is a library which helps you to work with the Office OpenXML file format as used in docx documents, pptx presentations, and xlsx spreadsheets.
document-processing document-conversion text-extraction microsoft-documentsjPod is a PDF manipulating and rendering framework. It provides functionality to read, verify the document against the PDF specification. It also provides content stream and rendering framework. It could able to create new document and do incremental updates.
pdf text-extraction pdf-library pdf-library-javaTCPDF is a PHP class for generating PDF documents without requiring external extensions. TCPDF Supports UTF-8, Unicode, RTL languages, XHTML, Javascript, digital signatures, barcodes and much more.
text-extraction pdf pdf-library pdf-text-extractionApache PDFBox is an open source Java PDF library for working with PDF documents. This library allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. It provides support for adding bookmarks, fonts, text extraction, Encryption, PDF printing and lot more.
pdf text-extraction pdf-library pdf-library-dotnet pdf-library-javaJODConverter automates conversions between office document formats using OpenOffice.org or LibreOffice. Supported formats include OpenDocument, PDF, RTF, HTML, Word, Excel, PowerPoint, and Flash. It can be used as a Java library, a command line tool, or a web application.
document-conversion text-extraction document officeborb is a pure python library to read, write and manipulate PDF documents. It represents a PDF document as a JSON-like datastructure of nested lists, dictionaries and primitives (numbers, string, booleans, etc).
pdf library sdk typesetting pdf-converter python3 pdf-conversion pdf-generation pdf-library text-extractionand more, in a single portable script. nb creates notes in text-based formats like Markdown, Org, and LaTeX, can work with files in any format, can import and export notes to many document formats, and can create private, password-protected encrypted notes and bookmarks. With nb, you can write notes using Vim, Emacs, VS Code, Sublime Text, and any other text editor you like, as well as terminal and GUI web browsers. nb works in any standard Linux / Unix environment, including macOS and Windows via WSL. Optional dependencies can be installed to enhance functionality, but nb works great without them.
git vim shell bash markdown cli productivity sync command-line notebook notes archiving vscode pandoc bookmarks note-taking knowledge-base bookmark-manager notes-app zettelkasten terminal prompt emacs versioning syncing encryption bookmarking tagging tags archiveiText is one of the popular and widely used PDF library. It is used to generate PDF documents dynamically. Mostly web developers will love it to generate PDF documents and reports based on data from an XML file or a database and serves it to the browser. It has support of adding bookmarks, watermarks, Encryption, Form filling and lot more.
pdf text-extraction pdf-library pdf-library-dotnet pdf-library-javaA library for PDF manipulation implementing Adobe PDF standard version 1.7. This library allows to read PDF files and apply changes to them, it is written in .NET 2.0 using Visual Studio 2005. Writing and Parsing PDF is supported.
text-extraction pdf pdf-library pdf-library-dotnetAPIs for manipulating various file formats based upon Open Office XML (ECMA-376) and Microsoft's OLE 2 Compound Document formats using pure Java. Apache POI is your Java Excel, Word and PowerPoint solution. We have a complete API for porting other OOXML and OLE 2 Compound Document formats and welcome others to participate.
office text-extraction office-api document-processing document-conversion microsoft-documentsThe Microsoft Document Translator translates Microsoft Office, plain text, HTML, PDF files and SRT caption files, from and to any of the 60+ languages supported by the Microsoft Translator web service. Document Translator uses the customer's own credentials and subscription to perform the Translation. Document Translator also may use custom MT systems trained via Custom Translator (https://portal.customtranslator.azure.ai). Document Translator uses Version 3 of the Translator API. Translate one or more Office documents, plain text HTML or PDF documents to another language, in one go.
AsciiDoc is a text document format for writing notes, documentation, articles, books, ebooks, slideshows, web pages, man pages and blogs. AsciiDoc files can be translated to many formats including HTML, PDF, EPUB, man page. AsciiDoc is highly configurable: both the AsciiDoc source file syntax and the backend output markups (which can be almost any type of SGML/XML markup) can be customized and extended by the user.
SWFTools is a collection of utilities for working with Adobe Flash files (SWF files). The tool collection includes programs for reading SWF files, combining them, and creating them from other content (like images, sound files, videos or sourcecode).
document-conversion swf-converter swf adobe-flash flashOpenPipe is an open source scalable platform for manipulating a stream of documents. A pipeline is an ordered set of steps / operations performed on a document to convert from its raw form to something ready to be put into the index. The operations performed on documents include language detection, field manipulation, POS tagging, entity extraction or submitting the document to a search engine.
content-connector text-analysis nlp document-pipeline text-processingSolr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites.
search-engine searchengine full-text-search facet distributed analytics
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
Add Projects.