TagSoup - HTML/XML parser for Haskell
TagSoup is a library for parsing HTML/XML. It supports the HTML 5 specification, and can be used to parse either well-formed XML, or unstructured and malformed HTML from the web. The library also provides useful functions to extract information from an HTML document, making it ideal for screen-scraping. The library provides a basic data type for a list of unstructured tags, a parser to convert HTML into this tag type, and useful functions and combinators for finding and extracting information.
http://community.haskell.org/~ndm/tagsoup/
comments powered by Disqus
Related Products
Arbica
Arabica is an XML and HTML processing toolkit, providing SAX, DOM, XPath, and partial XSLT implementations, written in Standard C++.
Beautiful Soup - Python HTML/XML parser
Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."
TagSoup - SAX-compliant parser in Java
TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
Expat
Expat is an XML parser library written in C. It is a stream-oriented parser in which an application registers handlers for things the parser might find in the XML document (like start tags).
Libxml++
libxml++ is a C++ wrapper for the libxml XML parser library.
Hpricot - HTML parser for Ruby
Hpricot is a fast, flexible HTML parser. Hpricot can be handy for reading broken XML files, since many of the same techniques can be used. If a quote is missing, Hpricot tries to figure it out. If tags overlap, Hpricot works on sorting them out.
Neko HTML Parser - simple HTML scanner
NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and fix up many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements. Automatically closes elements with optional end tags and can handle mismatched inline element tags.
SAXExpat
This is a SAX for .NET parser implementation based on the popular Expat XML parser.
Piccolo
Piccolo is a small, extremely fast XML parser for Java. It implements the SAX 1, SAX 2.0.1, and JAXP 1.1 (SAX parsing only) interfaces as a non-validating parser and attempts to detect all XML well-formedness errors. Piccolo was developed by Yuval Oren.
Xerces-C++
Xerces-C++ is a validating XML parser written in a portable subset of C++. Xerces-C++ makes it easy to give your application the ability to read and write XML data.