textract - node

  •        230

A text extraction node module. In almost all cases above, what textract cares about is the mime type. So .html and .htm, both possessing the same mime type, will be extracted. Other extensions that share mime types with those above should also extract successfully. For example, application/vnd.ms-excel is the mime type for .xls, but also for 5 other file types.

https://github.com/dbashford/textract

Dependencies:

mime : 2.2.0
pdf-text-extract : 1.3.1
xpath : 0.0.23
xmldom : 0.1.27
j : 0.4.3
cheerio : 0.22.0
marked : 0.3.17
meow : 3.7.0
got : 5.7.1
html-entities : 1.2.0
iconv-lite : 0.4.15
jschardet : 1.4.1
yauzl : 2.7.0

Tags
Implementation
License
Platform

   




Related Projects

ONLYOFFICE Desktop Editors - An office suite that combines text, spreadsheet and presentation editors allowing to create, view and edit local documents

  •    C

ONLYOFFICE Desktop Editors is a free and open source office suite comprises text documents, spreadsheets and presentations allowing to create, view and edit documents of any size and complexity, to easily switch to the online mode for real-time co-editing and collaboration. Features as reviewing, commenting and chat are available as well. Deal with multiple files within one and the same window thanks to the tab-based user interface

react-native-doc-viewer - React Native Doc Viewer (Supports file formats: xls,ppt,doc,xlsx,pptx,csv,docx,png,jpg,pdf,xml,binary

  •    Objective-C

React Native Native Module Bridge Quicklock Document Viewer for IOS + Android supports pdf, png, jpg, xls, ppt, doc, docx, pptx, xlx + Video Player mp4 supported

docconv - Converts PDF, DOC, DOCX, XML, HTML, RTF, etc to plain text

  •    Go

A Go wrapper library to convert PDF, DOC, DOCX, XML, HTML, RTF, ODT, Pages documents and images (see optional dependencies below) to plain text. Note for returning users: the Go code path for this pkg been moved to code.sajari.com/docconv. Follow the installation instructions to checkout a version of the code in the correct place.

Pandoc - General Markup Converter

  •    Haskell

Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. It an convert documents in markdown, reStructuredText, textile, HTML, DocBook, or LaTeX to HTML formats, Word processor formats, PDF and other markup formats.


docx4j - JAXB-based Java library for Word docx, Powerpoint pptx, and Excel xlsx files

  •    Java

docx4j is a library which helps you to work with the Office OpenXML file format as used in docx documents, pptx presentations, and xlsx spreadsheets.

Pandoc - Universal markup converter

  •    Haskell

Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. Pandoc can read Markdown, CommonMark, PHP Markdown Extra, GitHub-Flavored Markdown, MultiMarkdown, and (subsets of) Textile, reStructuredText, HTML, LaTeX, MediaWiki markup, TWiki markup, TikiWiki markup, Creole 1.0, Haddock markup, OPML, Emacs Org mode, DocBook, JATS, Muse, txt2tags, Vimwiki, EPUB, ODT, and Word docx.

free-file-icons - Platform-agnostic icons for audio, image, programming and office files.

  •    

A free icon set with vector images for popular extensions: AAC, AI, AIFF, AVI, C, CPP, CSS, CSV, DAT, DMG, DOC, EXE, FLV, GIF, H, HPP, HTML, ICS, JAVA, JPG, KEY, MID, MP3, MP4, MPG, PDF, PHP, PNG, PPT, PSD, PY, QT, RAR, RB, RTF, SQL, TIFF, TXT, WAV, XLS, XML, YML, ZIP. All icons are also offered in 512x512px, 48x48px, 32x32px.

Mollify - Web File Manager

  •    PHP

Mollify is a web file manager for publishing and managing files hosted in a web server. Users can have access to different files and with different permissions. It has support to Search files, Extract zip archives, File uploading (large files are uploaded in small chunks), WebDAV support and lot more.

PHPWord - A pure PHP library for reading and writing word processing documents

  •    PHP

PHPWord is a library written in pure PHP that provides a set of classes to write to and read from different document file formats. The current version of PHPWord supports Microsoft Office Open XML (OOXML or OpenXML), OASIS Open Document Format for Office Applications (OpenDocument or ODF), Rich Text Format (RTF), HTML, and PDF. PHPWord is an open source project licensed under the terms of LGPL version 3. PHPWord is aimed to be a high quality software product by incorporating continuous integration and unit testing. You can learn more about PHPWord by reading the Developers' Documentation.

pyexcel - Single API for reading, manipulating and writing data in csv, ods, xls, xlsx and xlsm files

  •    Python

If your company has embedded pyexcel and its components into a revenue generating product, please support me on patreon or bounty source to maintain the project and develop it further. If you are an individual, you are welcome to support me too and for however long you feel like. As my backer, you will receive early access to pyexcel related contents.

textract - extract text from any document. no muss. no fuss.

  •    HTML

Extract text from any document. No muss. No fuss. Full documentation.

pandoc-ruby - Ruby wrapper for Pandoc

  •    Ruby

PandocRuby is a wrapper for Pandoc, a Haskell library with command line tools for converting one markup format to another. Pandoc can convert documents from a variety of formats including markdown, reStructuredText, textile, HTML, DocBook, LaTeX, and MediaWiki markup to a variety of other formats, including markdown, reStructuredText, HTML, LaTeX, ConTeXt, PDF, RTF, DocBook XML, OpenDocument XML, ODT, GNU Texinfo, MediaWiki markup, groff man pages, HTML slide shows, EPUB, Microsoft Word docx, and more.

canvas - Cairo in Go: vector to raster, SVG, PDF, EPS, WASM, OpenGL, Gio, etc.

  •    Go

Canvas is a common vector drawing target that can output SVG, PDF, EPS, raster images (PNG, JPG, GIF, ...), HTML Canvas through WASM, OpenGL, and Gio. It has a wide range of path manipulation functionality such as flattening, stroking and dashing implemented. Additionally, it has a text formatter and embeds and subsets fonts (TTF, OTF, WOFF, WOFF2, or EOT) or converts them to outlines. It can be considered a Cairo or node-canvas alternative in Go. See the example below in Figure 1 for an overview of the functionality. Figure 1: top-left you can see text being fitted into a box, justified using Donald Knuth's linea breaking algorithm to stretch the spaces between words to fill the whole width. You can observe a variety of styles and text decorations applied, as well as support for LTR/RTL mixing and complex scripts. In the bottom-right the word "stroke" is being stroked and drawn as a path. Top-right we see a LaTeX formula that has been converted to a path. Left of that we see an ellipse showcasing precise dashing, notably the length of e.g. the short dash is equal wherever it is on the curve. Note that the dashes themselves are elliptical arcs as well (thus exactly precise even if magnified greatly). To the right we see a closed polygon of four points being smoothed by cubic Béziers that are smooth along the whole path, and the blue line on the left shows a smoothed open path. On the bottom you can see a rotated rasterized image. The result is equivalent for all renderers (PNG, PDF, SVG, etc.).

Information Extracter

  •    C++

A utility to extract meta-information (properties/comments) out of various file-types; e.g. HTML, PDF, RTF amp; various Office documents; OGG/MP3 files and JPEG/PNG/GIF images, which can be presented in various output formats (HTML, XML, LaTeX amp; plain t

yomu - Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)

  •    Ruby

Yomu is a library for extracting text and metadata from files and documents using the Apache Tika content analysis toolkit. For the complete list of supported formats, please visit the Apache Tika Supported Document Formats page.

Virtual Image Printer driver

  •    C

Virtual ImagePrinter is based on the Microsoft universal printer driver. ImagePrinter can print to file any printable document in your Windows system to the one or many BMP, PNG , JPG, TIFF or PDF files. Convert word to pdf, word to jpg and convert DOC, DOCX, PDF, TXT, HTM and RTF files to Image format Please visit http://code-industry.net for more information.

Xena - Digital Preservation Software

  •    Java

Xena transforms files into open data formats for long-term digital preservation, encodes content in Base64 and wraps in XML metadata. Formats supported include MBOX, PST, MSG, DOC, XLS, PPT, RTF, PNG, XML, PDF, JPG, TIFF, PCX, WAV, MP3 and more.

Marker - Markdown editor for linux made with GTK+-3.0

  •    C++

Marker is a markdown editor for linux made with GTK+-3.0. It provides support to view and edit markdown documents. It supports TeX math rendering with KaTeX or MathJax. It also supports Mermaid diagrams, Charter for plotting, Syntax highlighting for code blocks with highlight.js, Integrated sketch editor, Flexible export options to PDF, RTF, ODT, DOCX.

LICOM - Linux compressor/decompressor

  •    C

LICOM is a compression/de-compression tool that sits in-between the browser and content from the server. It aims to facilitate compression of ASCII text, RTF, DOC, PDF, HTML, GIF, JPG, BMP and similar filetypes.






We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.