TagSoup - SAX-compliant parser in Java

  •        5622

TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.

http://home.ccil.org/~cowan/XML/tagsoup/

Tags
Implementation
License
Platform

   




Related Projects

Arbica

  •    C++

Arabica is an XML and HTML processing toolkit, providing SAX, DOM, XPath, and partial XSLT implementations, written in Standard C++.

TagSoup - HTML/XML parser for Haskell

  •    Haskell

TagSoup is a library for parsing HTML/XML. It supports the HTML 5 specification, and can be used to parse either well-formed XML, or unstructured and malformed HTML from the web. The library also provides useful functions to extract information from an HTML document, making it ideal for screen-scraping.

Nokogiri - HTML, XML, SAX, and Reader parser with XPath and CSS selector support

  •    Ruby

Nokogiri (?) is an HTML, XML, SAX, DOM parser. Among Nokogiri's many features is the ability to search documents via XPath or CSS3 selectors, XML/HTML builder, XSLT transformer. Nokogiri parses and searches XML/HTML using native libraries (either C or Java, depending on your Ruby), which means it's fast and standards-compliant.

xml-stream - XML stream parser based on Expat. Made for Node.

  •    Javascript

XmlStream is a Node.js XML stream parser and editor, based on node-expat (libexpat SAX-like parser binding). When working with large XML files, it is probably a bad idea to use an XML to JavaScript object converter, or simply buffer the whole document in memory. Then again, a typical SAX parser might be too low-level for some tasks (and often a real pain).

Piccolo

  •    Java

Piccolo is a small, extremely fast XML parser for Java. It implements the SAX 1, SAX 2.0.1, and JAXP 1.1 (SAX parsing only) interfaces as a non-validating parser and attempts to detect all XML well-formedness errors. Piccolo was developed by Yuval Oren.


Apache Xerces for Perl XML Parser - Perl API to the Apache Xerces XML parser.

  •    Perl

Perl API to the Apache Xerces XML parser.

GXPARSE: XML stream parser API

  •    Java

Generic Java XML stream parser API makes it much easier to use event-based stream parsers like SAX Parser. Includes an implementation for SAX parser. Also supports recursive pattern matching.

Apache Xerces for Java XML Parser

  •    Java

Xerces-J is a validating XML parser written in Java.

SAXExpat

  •    C#

This is a SAX for .NET parser implementation based on the popular Expat XML parser.

Piccolo XML Parser for Java

  •    Java

Piccolo is the fastest SAX parser for Java, supporting SAX1, SAX2, and JAXP (SAX only). Piccolo is different from other parsers in that it was developed using parser generators. It weighs 160K including XML APIs. See http://piccolo.sf.net for more info.

parse5 - HTML parsing/serialization toolset for Node

  •    Javascript

HTML parsing/serialization toolset for Node.js. WHATWG HTML Living Standard (aka HTML5)-compliant.parse5 provides nearly everything you may need when dealing with HTML. It's the fastest spec-compliant HTML parser for Node to date. It parses HTML the way the latest version of your browser does. It has proven itself reliable in such projects as jsdom, Angular2, Polymer and many more.

posthtml - PostHTML is a tool to transform HTML/XML with JS plugins

  •    Javascript

PostHTML is a tool for transforming HTML/XML with JS plugins. PostHTML itself is very small. It includes only a HTML parser, a HTML node tree API and a node tree stringifier. All HTML transformations are made by plugins. And these plugins are just small plain JS functions, which receive a HTML node tree, transform it, and return a modified tree.

htmlparser2 - forgiving html and xml parser

  •    Javascript

A forgiving HTML/XML/RSS parser. The parser can handle streams and provides a callback interface. A live demo of htmlparser2 is available here.

Xerces-C++

  •    C++

Xerces-C++ is a validating XML parser written in a portable subset of C++. Xerces-C++ makes it easy to give your application the ability to read and write XML data.

Libxml++

  •    C

libxml++ is a C++ wrapper for the libxml XML parser library.

parser-lib - Collection of parsers written in JavaScript

  •    Javascript

The ParserLib CSS parser is a CSS3 SAX-inspired parser written in JavaScript. It handles standard CSS syntax as well as validation (checking of property names and values) although it is not guaranteed to thoroughly validate all possible CSS properties.The CSS parser is built for a number of different JavaScript environments. The most recently released version of the parser can be found in the dist directory when you check out the repository; run npm run build to regenerate them from the latest sources.

node-htmlparser - Forgiving HTML/XML/RSS Parser in JS for *both* Node and Browsers

  •    Javascript

#NodeHtmlParser A forgiving HTML/XML/RSS parser written in JS for both the browser and NodeJS (yes, despite the name it works just fine in any modern browser). The parser can handle streams (chunked data) and supports custom handlers for writing custom DOMs/output.

oga - Moved to https://gitlab.com/yorickpeterse/oga

  •    Ruby

NOTE: my spare time is limited which means I am unable to dedicate a lot of time on Oga. If you're interested in contributing to FOSS, please take a look at the open issues and submit a pull request to address them where possible. Oga is an XML/HTML parser written in Ruby. It provides an easy to use API for parsing, modifying and querying documents (using XPath expressions). Oga does not require system libraries such as libxml, making it easier and faster to install on various platforms. To achieve better performance Oga uses a small, native extension (C for MRI/Rubinius, Java for JRuby).

Kanna - Kanna(鉋) is an XML/HTML parser for Swift.

  •    Swift

Kanna(鉋) is an XML/HTML parser for cross-platform(macOS, iOS, tvOS, watchOS and Linux!). It was inspired by Nokogiri(鋸).

node-expat - libexpat XML SAX parser binding for node.js

  •    Javascript

We don't emit an error event because libexpat doesn't use a callback either. Instead, check that parse() returns true. A descriptive string can be obtained via getError() to provide user feedback. Alternatively, use the Parser like a node Stream. write() will emit error events.