mammoth.js - Convert Word documents (.docx files) to HTML

  •        375

Mammoth is designed to convert .docx documents, such as those created by Microsoft Word, and convert them to HTML. Mammoth aims to produce simple and clean HTML by using semantic information in the document, and ignoring other details. For instance, Mammoth converts any paragraph with the style Heading 1 to h1 elements, rather than attempting to exactly copy the styling (font, text size, colour, etc.) of the heading. There's a large mismatch between the structure used by .docx and the structure of HTML, meaning that the conversion is unlikely to be perfect for more complicated documents. Mammoth works best if you only use styles to semantically mark up your document.

https://github.com/mwilliamson/mammoth.js

Dependencies:

bluebird : ~3.4.0
sax : ~1.1.1
underscore : ~1.8.3
lop : ~0.4.0
argparse : ~1.0.3
jszip : ~2.5.0
xmlbuilder : ~2.6.4
path-is-absolute : ^1.0.0

Tags
Implementation
License
Platform

   




Related Projects

Pandoc - Universal markup converter

  •    Haskell

Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. Pandoc can read Markdown, CommonMark, PHP Markdown Extra, GitHub-Flavored Markdown, MultiMarkdown, and (subsets of) Textile, reStructuredText, HTML, LaTeX, MediaWiki markup, TWiki markup, TikiWiki markup, Creole 1.0, Haddock markup, OPML, Emacs Org mode, DocBook, JATS, Muse, txt2tags, Vimwiki, EPUB, ODT, and Word docx.

docx4j - JAXB-based Java library for Word docx, Powerpoint pptx, and Excel xlsx files

  •    Java

docx4j is a library which helps you to work with the Office OpenXML file format as used in docx documents, pptx presentations, and xlsx spreadsheets.

ONLYOFFICE Desktop Editors - An office suite that combines text, spreadsheet and presentation editors allowing to create, view and edit local documents

  •    C

ONLYOFFICE Desktop Editors is a free and open source office suite comprises text documents, spreadsheets and presentations allowing to create, view and edit documents of any size and complexity, to easily switch to the online mode for real-time co-editing and collaboration. Features as reviewing, commenting and chat are available as well. Deal with multiple files within one and the same window thanks to the tab-based user interface

DocX - Fast and easy to use

  •    CSharp

DocX is a .NET library that allows developers to manipulate Word 2007/2010/2013 files, in an easy and intuitive manner. DocX is fast, lightweight and best of all it does not require Microsoft Word or Office to be installed.NOTE: There is a new Master branch as of Oct. 3, 2017. Please read about the Classic branch if you were using this project before the change.

Pandoc - General Markup Converter

  •    Haskell

Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. It an convert documents in markdown, reStructuredText, textile, HTML, DocBook, or LaTeX to HTML formats, Word processor formats, PDF and other markup formats.


academicmarkdown - Academic writing with Markdown

  •    Python

Academic Markdown is a Python module for generating .md, .html, .pdf, .docx, and .odt files from Markdown source. Pandoc is used for most of the heavy lifting, so refer to the Pandoc website for detailed information about writing in Pandoc Markdown. However, Academic Markdown offers some additional functionality that is useful for writing scientific documents, such as integration with Zotero references, and a number of useful Academic Markdown extensions. At present, the main target for Academic Markdown is the OpenSesame documentation site, http://osdoc.cogsci.nl/, although it may in time grow into a more comprehensive and user-friendly tool.

PHPWord - A pure PHP library for reading and writing word processing documents

  •    PHP

PHPWord is a library written in pure PHP that provides a set of classes to write to and read from different document file formats. The current version of PHPWord supports Microsoft Office Open XML (OOXML or OpenXML), OASIS Open Document Format for Office Applications (OpenDocument or ODF), Rich Text Format (RTF), HTML, and PDF. PHPWord is an open source project licensed under the terms of LGPL version 3. PHPWord is aimed to be a high quality software product by incorporating continuous integration and unit testing. You can learn more about PHPWord by reading the Developers' Documentation.

DocX

  •    DotNet

DocX is a .NET library written in C# which allows a developer to manipulate Word 2007 files in an easy and intuitive way.

Simple OOXML

  •    

Simple OOXML makes the creation of Open Office XML documents easier for developers. Modify or create any .docx or .xlsx document without Microsoft Word or Microsoft Excel. Uses the Open Office SDK v 2.0.

Joeffice - Office Written in Java

  •    Java

Joeffice is the first open source office suite written in Java. Its features include Docking system. Visualize several documents in the same window, It can have a lot of documents open at the same time and easily switch from one to another. It works with Microsoft document formats (docx, xslx, pptx). It can get data through web services (RMI, SOAP, REST).

docconv - Converts PDF, DOC, DOCX, XML, HTML, RTF, etc to plain text

  •    Go

A Go wrapper library to convert PDF, DOC, DOCX, XML, HTML, RTF, ODT, Pages documents and images (see optional dependencies below) to plain text. Note for returning users: the Go code path for this pkg been moved to code.sajari.com/docconv. Follow the installation instructions to checkout a version of the code in the correct place.

gotenberg - :scroll: A stateless API for converting Markdown files, HTML files and Office documents to PDF

  •    Go

At TheCodingMachine, we build a lot of web applications (intranets, extranets and so on) which require to generate PDF from various sources. Each time, we ended up using some well known libraries like wkhtmltopdf or unoconv and kind of lost time by reimplementing a solution from a project to another project. Meh. The API is now available on your host under http://127.0.0.1:3000.

word2markdown - Convert Word to Markdown, with images and math

  •    XSLT

For Word-to-Markdown scripts, first navigate to this directory, using cd doc-to-md. Run './accept.sh' to generate new markdown, which you can compare to the original markdown using git.

phishery - An SSL Enabled Basic Auth Credential Harvester with a Word Document Template URL Injector

  •    Go

Phishery is a Simple SSL Enabled HTTP server with the primary purpose of phishing credentials via Basic Authentication. Phishery also provides the ability easily to inject the URL into a .docx Word document. The power of phishery is best demonstrated by setting a Word document's template to a phishery URL. This causes Microsoft Word to make a request to the URL, resulting in an Authentication Dialog being shown to the end-user. The ability to inject any .docx file with a URL is possible using phishery's -i [in docx], -o [out docx], and -u [url] options.

Word To SharePoint (Transform Word Documents to MOSS / WSS)

  •    

A SharePoint Feature for easy conversion of Word 2007 documents to Sharepoint/MOSS. The solution also extracts, transfers and re-links images to a selected ImageLibrary, includes styles, tables, etc.

HackMyResume - Generate polished résumés and CVs in HTML, Markdown, LaTeX, MS Word, PDF, plain text, JSON, XML, YAML, smoke signal, and carrier pigeon

  •    Javascript

Create polished résumés and CVs in multiple formats from your command line or shell. Author in clean Markdown and JSON, export to Word, HTML, PDF, LaTeX, plain text, and other arbitrary formats. Fight the power, save trees. Compatible with FRESH and JRS resumes. HackMyResume is built with Node.js and runs on recent versions of OS X, Linux, or Windows. View the FAQ.

DOTX to DOCX Converter

  •    CSharp

DOTX to DOCX Converter converts Office Open XML templates (DOTX/DOTM) to Office Open XML documents (DOCX/DOCM). The program is an effective supplement to the Microsoft Office Compatibility Pack, which cannot convert these files.

MOSS Document Converter

  •    

Microsoft Office SharePoint Server (MOSS) Document Converters with Word & Excel 2007 on the server. Converting Office 2003 file-types (doc, xls) to pdf and xps. Could easily be altered for work for docx and xlsx file-types. Desktop Automation on the Server: Previously, us...

unioffice - Pure go library for creating and processing Office Word (

  •    Go

Announcement (2019/04/29): UniDoc aquires gooxml. UniDoc (https://unidoc.io and https://github.com/unidoc) has aquired gooxml from Baliance and we plan to add it to our suite of document format support for Go. The repository (gooxml) will be moving to a new home: https://github.com/unidoc/unioffice and the package name will be come unioffice.

HTML to docx Converter

  •    

This converts HTML into Word documents (docx format). The code is written in PHP and works with PHPWord.