TagSoup - HTML/XML parser for Haskell

  •        0

TagSoup is a library for parsing HTML/XML. It supports the HTML 5 specification, and can be used to parse either well-formed XML, or unstructured and malformed HTML from the web. The library also provides useful functions to extract information from an HTML document, making it ideal for screen-scraping. The library provides a basic data type for a list of unstructured tags, a parser to convert HTML into this tag type, and useful functions and combinators for finding and extracting information.




comments powered by Disqus

Related Projects

TagSoup - SAX-compliant parser in Java

TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.


Expat is an XML parser library written in C. It is a stream-oriented parser in which an application registers handlers for things the parser might find in the XML document (like start tags).


Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

HtmlCleaner - HTML parser in Java

HtmlCleaner is HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.

JsonFx.NET - JSON serialization framework for .NET

JsonFx v2.0 - JSON serialization framework for .NET. It has unified interface for reading / writing JSON, BSON, XML, JsonML. It implements LINQ-to-JSON, Supports reading/writing using DataContract, XmlSerialization, JsonName, attributes and lot more.

Hpricot - HTML parser for Ruby

Hpricot is a fast, flexible HTML parser. Hpricot can be handy for reading broken XML files, since many of the same techniques can be used. If a quote is missing, Hpricot tries to figure it out. If tags overlap, Hpricot works on sorting them out.


Arabica is an XML and HTML processing toolkit, providing SAX, DOM, XPath, and partial XSLT implementations, written in Standard C++.


ROME is an set of Java tools for parsing, generating and publishing RSS and Atom feeds. The core ROME library depends only on the JDOM XML parser and supports parsing, generating and converting all of the popular RSS and Atom formats including RSS 0.90, RSS 0.91 Netscape, RSS 0.91 Userland, RSS 0.92, RSS 0.93, RSS 0.94, RSS 1.0, RSS 2.0, Atom 0.3, and Atom 1.0. You can parse to an RSS object model, an Atom object model or an abstract SyndFeed model that can model either family of formats.

Beautiful Soup - Python HTML/XML parser

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."

Tablesorter -Flexible client-side table sorting

Tablesorter is a jQuery plugin for turning a standard HTML table with THEAD and TBODY tags into a sortable table without page refreshes. tablesorter can successfully parse and sort many types of data including linked data in a cell.

Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.

Tag Cloud >>