html5lib - Standards-compliant library for parsing and serializing HTML documents and fragments in Python

  •        167

html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.By default, the document will be an xml.etree element instance. Whenever possible, html5lib chooses the accelerated ElementTree implementation (i.e. xml.etree.cElementTree on Python 2.x). Two other tree types are supported: xml.dom.minidom and lxml.etree.

https://github.com/html5lib/html5lib-python

Tags
Implementation
License
Platform

   




Related Projects

html5ever - High-performance browser-grade HTML5 parser

  •    Rust

html5ever is an HTML parser developed as part of the Servo project. It can parse and serialize HTML according to the WHATWG specs (aka "HTML5"). There are some omissions at present, most of which are documented in the bug tracker. html5ever passes all tokenizer tests from html5lib-tests, and most tree builder tests outside of the unimplemented features. The goal is to pass all html5lib tests, and also provide all hooks needed by a production web browser, e.g. document.write.

html5-parser - Fast C based HTML 5 parsing for python

  •    C

A fast, standards compliant, C based, HTML 5 parser for python. Over thirty times as fast as pure python based parsers, such as html5lib. See documentation for details.

html5-php - An HTML5 parser and serializer for PHP.

  •    HTML

HTML5 is a standards-compliant HTML5 parser and writer written entirely in PHP. It is stable and used in many production websites, and has well over one million downloads.HTML5 provides the following features.

TagSoup - HTML/XML parser for Haskell

  •    Haskell

TagSoup is a library for parsing HTML/XML. It supports the HTML 5 specification, and can be used to parse either well-formed XML, or unstructured and malformed HTML from the web. The library also provides useful functions to extract information from an HTML document, making it ideal for screen-scraping.

Hpricot - HTML parser for Ruby

  •    C

Hpricot is a fast, flexible HTML parser. Hpricot can be handy for reading broken XML files, since many of the same techniques can be used. If a quote is missing, Hpricot tries to figure it out. If tags overlap, Hpricot works on sorting them out.


html-agility-pack - Html Agility Pack (HAP)

  •    CSharp

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).Have a question? Ask questions and find answers from over 2500 questions.

myhtml - Fast C/C++ HTML 5 Parser. Using threads.

  •    C

MyHTML is a fast HTML Parser using Threads implemented as a pure C99 library with no outside dependencies. Please use the HTML parser from the Lexbor project. It is stable, has more features, and — yes — it's very fast.

AngleSharp - The ultimate angle brackets parser library parsing HTML5, MathML, SVG and CSS to construct a DOM based on the official W3C specifications

  •    CSharp

AngleSharp is a .NET library that gives you the ability to parse angle bracket based hyper-texts like HTML, SVG, and MathML. XML without validation is also supported by the library. An important aspect of AngleSharp is that CSS can also be parsed. The included parser is built upon the official W3C specification. This produces a perfectly portable HTML5 DOM representation of the given source code and ensures compatibility with results in evergreen browsers. Also standard DOM features such as querySelector or querySelectorAll work for tree traversal.

Modest - Modest is a fast HTML renderer implemented as a pure C99 library with no outside dependencies

  •    C

Modest is a fast HTML renderer implemented as a pure C99 library with no outside dependencies. Please use the lexbor. It is stable, has more features, and — yes — it's very fast.

goquery - A little like that j-thing, only in Go.

  •    Go

goquery brings a syntax and a set of features similar to jQuery to the Go language. It is based on Go's net/html package and the CSS Selector library cascadia. Since the net/html parser returns nodes, and not a full-featured DOM tree, jQuery's stateful manipulation functions (like height(), css(), detach()) have been left off.Also, because the net/html parser requires UTF-8 encoding, so does goquery: it is the caller's responsibility to ensure that the source document provides UTF-8 encoded HTML. See the wiki for various options to do this.

jspoon - Annotation based HTML to Java parser + Retrofit converter

  •    Java

jspoon is a Java library that provides parsing HTML into Java objects basing on CSS selectors. It uses jsoup underneath as a HTML parser. It looks for the first occurrence in HTML and sets its value to a field.

posthtml - PostHTML is a tool to transform HTML/XML with JS plugins

  •    Javascript

PostHTML is a tool for transforming HTML/XML with JS plugins. PostHTML itself is very small. It includes only a HTML parser, a HTML node tree API and a node tree stringifier. All HTML transformations are made by plugins. And these plugins are just small plain JS functions, which receive a HTML node tree, transform it, and return a modified tree.

parse5 - HTML parsing / serialization toolset for Node.js

  •    Javascript

parse5 provides nearly everything you may need when dealing with HTML. It's the fastest spec-compliant HTML parser for Node to date. It parses HTML the way the latest version of your browser does. It has proven itself reliable in such projects as jsdom, Angular2, Polymer and many more.

Neko HTML Parser - simple HTML scanner

  •    Java

NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and fix up many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements. Automatically closes elements with optional end tags and can handle mismatched inline element tags.

Jodd - The Unbearable Lightness of Java

  •    Java

Jodd is developer-friendly set of Java microframeworks, tools and utilities, under 1.7 MB. Build with common sense to make things simple, but not simpler. Its feature include slick IoC container, elegant MVC framework, unique AOP engine, thin DB-object mapper, standalone transaction manager, focused validation tool, versatile HTML parsers, pages decorator, super properties, powerful BeanUtil, timeless JDateTime, easy email, many super utilities... and more.

Beautiful Soup - Python HTML/XML parser

  •    Python

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."

HTML Purifier - Standards compliant HTML filter written in PHP

  •    PHP

HTML Purifier is an HTML filtering solution that uses a unique combination of robust whitelists and agressive parsing to ensure that not only are XSS attacks thwarted, but the resulting HTML is standards compliant.

Html Agility Pack

  •    

This is an HTML parser that builds a read/write DOM from “real world” HTML files. It supports XPATH or XSLT and is tolerant with "real world" malformed HTML.

php-simple-html-dom-parser - PHP Simple HTML DOM Parser adaptation for Composer and PSR-0

  •    HTML

A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way! Require PHP 5+. Supports invalid HTML. Find tags on an HTML page with selectors just like jQuery. Extract contents from HTML in a single line.

HtmlCleaner - HTML parser in Java

  •    Java

HtmlCleaner is HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.






We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.