parse5 - HTML parsing/serialization toolset for Node

  •        178

HTML parsing/serialization toolset for Node.js. WHATWG HTML Living Standard (aka HTML5)-compliant.parse5 provides nearly everything you may need when dealing with HTML. It's the fastest spec-compliant HTML parser for Node to date. It parses HTML the way the latest version of your browser does. It has proven itself reliable in such projects as jsdom, Angular2, Polymer and many more.

https://github.com/inikulin/parse5

Tags
Implementation
License
Platform

   




Related Projects

ineed - Web scraping and HTML-reprocessing. The easy way.

  •    Javascript

Web scraping and HTML-reprocessing. The easy way.ineed doesn't build and traverse DOM-tree, it operates on sequence of HTML tokens instead. Whole processing is done in one-pass, therefore, it's blazing fast! The token stream is produced by parse5 which parses HTML exactly the same way modern browsers do.

html5ever - High-performance browser-grade HTML5 parser

  •    Rust

html5ever is an HTML parser developed as part of the Servo project. It can parse and serialize HTML according to the WHATWG specs (aka "HTML5"). There are some omissions at present, most of which are documented in the bug tracker. html5ever passes all tokenizer tests from html5lib-tests, and most tree builder tests outside of the unimplemented features. The goal is to pass all html5lib tests, and also provide all hooks needed by a production web browser, e.g. document.write.

AngleSharp - The ultimate angle brackets parser library parsing HTML5, MathML, SVG and CSS to construct a DOM based on the official W3C specifications

  •    CSharp

AngleSharp is a .NET library that gives you the ability to parse angle bracket based hyper-texts like HTML, SVG, and MathML. XML without validation is also supported by the library. An important aspect of AngleSharp is that CSS can also be parsed. The included parser is built upon the official W3C specification. This produces a perfectly portable HTML5 DOM representation of the given source code and ensures compatibility with results in evergreen browsers. Also standard DOM features such as querySelector or querySelectorAll work for tree traversal.

SwiftSoup - SwiftSoup: Pure Swift HTML Parser, with best of DOM, CSS, and jquery (Supports Linux, iOS, Mac, tvOS, watchOS)

  •    Swift

SwiftSoup is a pure Swift library, cross-platform(macOS, iOS, tvOS, watchOS and Linux!), for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. SwiftSoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do. After parsing a document, and finding some elements, you'll want to get at the data inside those elements.

html5-php - An HTML5 parser and serializer for PHP.

  •    HTML

HTML5 is a standards-compliant HTML5 parser and writer written entirely in PHP. It is stable and used in many production websites, and has well over one million downloads.HTML5 provides the following features.


TagSoup - HTML/XML parser for Haskell

  •    Haskell

TagSoup is a library for parsing HTML/XML. It supports the HTML 5 specification, and can be used to parse either well-formed XML, or unstructured and malformed HTML from the web. The library also provides useful functions to extract information from an HTML document, making it ideal for screen-scraping.

Nokogiri - HTML, XML, SAX, and Reader parser with XPath and CSS selector support

  •    Ruby

Nokogiri (?) is an HTML, XML, SAX, DOM parser. Among Nokogiri's many features is the ability to search documents via XPath or CSS3 selectors, XML/HTML builder, XSLT transformer. Nokogiri parses and searches XML/HTML using native libraries (either C or Java, depending on your Ruby), which means it's fast and standards-compliant.

TagSoup - SAX-compliant parser in Java

  •    Java

TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.

Jodd - The Unbearable Lightness of Java

  •    Java

Jodd is developer-friendly set of Java microframeworks, tools and utilities, under 1.7 MB. Build with common sense to make things simple, but not simpler. Its feature include slick IoC container, elegant MVC framework, unique AOP engine, thin DB-object mapper, standalone transaction manager, focused validation tool, versatile HTML parsers, pages decorator, super properties, powerful BeanUtil, timeless JDateTime, easy email, many super utilities... and more.

html5lib - Standards-compliant library for parsing and serializing HTML documents and fragments in Python

  •    Python

html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.By default, the document will be an xml.etree element instance. Whenever possible, html5lib chooses the accelerated ElementTree implementation (i.e. xml.etree.cElementTree on Python 2.x). Two other tree types are supported: xml.dom.minidom and lxml.etree.

Neko HTML Parser - simple HTML scanner

  •    Java

NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and fix up many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements. Automatically closes elements with optional end tags and can handle mismatched inline element tags.

html5-parser - Fast C based HTML 5 parsing for python

  •    C

A fast, standards compliant, C based, HTML 5 parser for python. Over thirty times as fast as pure python based parsers, such as html5lib. See documentation for details.

Hpricot - HTML parser for Ruby

  •    C

Hpricot is a fast, flexible HTML parser. Hpricot can be handy for reading broken XML files, since many of the same techniques can be used. If a quote is missing, Hpricot tries to figure it out. If tags overlap, Hpricot works on sorting them out.

HtmlParser

  •    

HtmlParser is a collection of Html processing and querying projects. The first element, the low level parser, is based on and extends Html::Parser from CPAN. This core is an event producing document parser with all other tools and libraries acting as subscribers. Using this...

Fuzi - A fast & lightweight XML & HTML parser in Swift with XPath & CSS support

  •    Swift

Fuzi is based on a Swift port of Mattt Thompson's Ono(斧), using most of its low level implementaions with moderate class & interface redesign following standard Swift conventions, along with several bug fixes. Fuzi(斧子) means "axe", in homage to Ono(斧), which in turn is inspired by Nokogiri (鋸), which means "saw".

markdown - A super fast, highly extensible markdown parser for PHP

  •    HTML

A set of PHP classes, each representing a Markdown flavor, and a command line tool for converting markdown files to HTML files. The implementation focus is to be fast (see benchmark) and extensible. Parsing Markdown to HTML is as simple as calling a single method (see Usage) providing a solid implementation that gives most expected results even in non-trivial edge cases.

Noggit - JSON streaming parser

  •    Java

Noggit is the world's fastest streaming JSON parser for Java. It is used in Apache Solr.

html-agility-pack - Html Agility Pack (HAP)

  •    CSharp

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).Have a question? Ask questions and find answers from over 2500 questions.

HTMLReader - A WHATWG-compliant HTML parser in Objective-C.

  •    HTML

A WHATWG-compliant HTML parser with CSS selectors in Objective-C and Foundation. It parses HTML just like a browser. Copy the files in the Sources folder into your project.

DiDOM - Simple and fast HTML parser

  •    PHP

DiDOM - simple and fast HTML parser. The second parameter specifies if you need to load file. Default is false.