textract - node

  •        40

A text extraction node module. In almost all cases above, what textract cares about is the mime type. So .html and .htm, both possessing the same mime type, will be extracted. Other extensions that share mime types with those above should also extract successfully. For example, application/vnd.ms-excel is the mime type for .xls, but also for 5 other file types.

https://github.com/dbashford/textract

Dependencies:

mime : 2.2.0
pdf-text-extract : 1.3.1
xpath : 0.0.23
xmldom : 0.1.27
j : 0.4.3
cheerio : 0.22.0
marked : 0.3.17
meow : 3.7.0
got : 5.7.1
html-entities : 1.2.0
iconv-lite : 0.4.15
jschardet : 1.4.1
yauzl : 2.7.0

Tags
Implementation
License
Platform

   




Related Projects

ONLYOFFICE Desktop Editors - An office suite that combines text, spreadsheet and presentation editors allowing to create, view and edit local documents

  •    C

ONLYOFFICE Desktop Editors is a free and open source office suite comprises text documents, spreadsheets and presentations allowing to create, view and edit documents of any size and complexity, to easily switch to the online mode for real-time co-editing and collaboration. Features as reviewing, commenting and chat are available as well. Deal with multiple files within one and the same window thanks to the tab-based user interface

react-native-doc-viewer - React Native Doc Viewer (Supports file formats: xls,ppt,doc,xlsx,pptx,csv,docx,png,jpg,pdf,xml,binary

  •    Objective-C

React Native Native Module Bridge Quicklock Document Viewer for IOS + Android supports pdf, png, jpg, xls, ppt, doc, docx, pptx, xlx + Video Player mp4 supported

docconv - Converts PDF, DOC, DOCX, XML, HTML, RTF, etc to plain text

  •    Go

A Go wrapper library to convert PDF, DOC, DOCX, XML, HTML, RTF, ODT, Pages documents and images (see optional dependencies below) to plain text. Note for returning users: the Go code path for this pkg been moved to code.sajari.com/docconv. Follow the installation instructions to checkout a version of the code in the correct place.

Pandoc - General Markup Converter

  •    Haskell

Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. It an convert documents in markdown, reStructuredText, textile, HTML, DocBook, or LaTeX to HTML formats, Word processor formats, PDF and other markup formats.


docx4j - JAXB-based Java library for Word docx, Powerpoint pptx, and Excel xlsx files

  •    Java

docx4j is a library which helps you to work with the Office OpenXML file format as used in docx documents, pptx presentations, and xlsx spreadsheets.

Pandoc - Universal markup converter

  •    Haskell

Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. Pandoc can read Markdown, CommonMark, PHP Markdown Extra, GitHub-Flavored Markdown, MultiMarkdown, and (subsets of) Textile, reStructuredText, HTML, LaTeX, MediaWiki markup, TWiki markup, TikiWiki markup, Creole 1.0, Haddock markup, OPML, Emacs Org mode, DocBook, JATS, Muse, txt2tags, Vimwiki, EPUB, ODT, and Word docx.

free-file-icons - Platform-agnostic icons for audio, image, programming and office files.

  •    

A free icon set with vector images for popular extensions: AAC, AI, AIFF, AVI, C, CPP, CSS, CSV, DAT, DMG, DOC, EXE, FLV, GIF, H, HPP, HTML, ICS, JAVA, JPG, KEY, MID, MP3, MP4, MPG, PDF, PHP, PNG, PPT, PSD, PY, QT, RAR, RB, RTF, SQL, TIFF, TXT, WAV, XLS, XML, YML, ZIP. All icons are also offered in 512x512px, 48x48px, 32x32px.

Mollify - Web File Manager

  •    PHP

Mollify is a web file manager for publishing and managing files hosted in a web server. Users can have access to different files and with different permissions. It has support to Search files, Extract zip archives, File uploading (large files are uploaded in small chunks), WebDAV support and lot more.

PHPWord - A pure PHP library for reading and writing word processing documents

  •    PHP

PHPWord is a library written in pure PHP that provides a set of classes to write to and read from different document file formats. The current version of PHPWord supports Microsoft Office Open XML (OOXML or OpenXML), OASIS Open Document Format for Office Applications (OpenDocument or ODF), Rich Text Format (RTF), HTML, and PDF. PHPWord is an open source project licensed under the terms of LGPL version 3. PHPWord is aimed to be a high quality software product by incorporating continuous integration and unit testing. You can learn more about PHPWord by reading the Developers' Documentation.

pyexcel - Single API for reading, manipulating and writing data in csv, ods, xls, xlsx and xlsm files

  •    Python

If your company has embedded pyexcel and its components into a revenue generating product, please support me on patreon or bounty source to maintain the project and develop it further. If you are an individual, you are welcome to support me too and for however long you feel like. As my backer, you will receive early access to pyexcel related contents.

textract - extract text from any document. no muss. no fuss.

  •    HTML

Extract text from any document. No muss. No fuss. Full documentation.

Information Extracter

  •    C++

A utility to extract meta-information (properties/comments) out of various file-types; e.g. HTML, PDF, RTF amp; various Office documents; OGG/MP3 files and JPEG/PNG/GIF images, which can be presented in various output formats (HTML, XML, LaTeX amp; plain t

yomu - Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)

  •    Ruby

Yomu is a library for extracting text and metadata from files and documents using the Apache Tika content analysis toolkit. For the complete list of supported formats, please visit the Apache Tika Supported Document Formats page.

Virtual Image Printer driver

  •    C

Virtual ImagePrinter is based on the Microsoft universal printer driver. ImagePrinter can print to file any printable document in your Windows system to the one or many BMP, PNG , JPG, TIFF or PDF files. Convert word to pdf, word to jpg and convert DOC, DOCX, PDF, TXT, HTM and RTF files to Image format Please visit http://code-industry.net for more information.

Xena - Digital Preservation Software

  •    Java

Xena transforms files into open data formats for long-term digital preservation, encodes content in Base64 and wraps in XML metadata. Formats supported include MBOX, PST, MSG, DOC, XLS, PPT, RTF, PNG, XML, PDF, JPG, TIFF, PCX, WAV, MP3 and more.

LICOM - Linux compressor/decompressor

  •    C

LICOM is a compression/de-compression tool that sits in-between the browser and content from the server. It aims to facilitate compression of ASCII text, RTF, DOC, PDF, HTML, GIF, JPG, BMP and similar filetypes.

PowerMeta - PowerMeta searches for publicly available files hosted on various websites for a particular domain by using specially crafted Google, and Bing searches

  •    PowerShell

PowerMeta searches for publicly available files hosted on various websites for a particular domain by using specially crafted Google, and Bing searches. It then allows for the download of those files from the target domain. After retrieving the files, the metadata associated with them can be analyzed by PowerMeta. Some interesting things commonly found in metadata are usernames, domains, software titles, and computer names. For many organizations it's common to find publicly available files posted on their external websites. Many times these files contain sensitive information that might be of benefit to an attacker like usernames, domains, software titles or computer names. PowerMeta searches both Bing and Google for files on a particular domain using search strings like "site:targetdomain.com filetype:pdf". By default it searches for "pdf, docx, xlsx, doc, xls, pptx, and ppt".

Arachnode.net

  •    CSharp

An open source .NET web crawler written in C# using SQL 2005/2008. Arachnode.net is a complete and comprehensive .NET web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages.

Yioop - Open Source Search Engine Software

  •    PHP

Yioop is an open source, PHP search engine capable of crawling, index, and providing search results for hundred of millions of pages on relatively low end hardware. It can index a variety of text formats HTML, RSS, PDF, RTF, DOC and images GIF, JPEG, PNG, etc. It can import data from ARC, WARC, Media-Wiki, Open Directory RDF. It is easily localized to many languages. It has built-in support for new feeds, discussion groups, blogs, and wikis. It also supports mixing indexes to create mash ups.