Hpricot - HTML parser for Ruby

Hpricot is a fast, flexible HTML parser. Hpricot can be handy for reading broken XML files, since many of the same techniques can be used. If a quote is missing, Hpricot tries to figure it out. If tags overlap, Hpricot works on sorting them out. Source code location: http://github.com/hpricot/hpricot

  • Hpricot is a standalone library. It requires no other libraries. Just Ruby!
  • Hpricot works hard to sort out bad HTML and pays a small penalty in order to get that right.
  • If you can see it in Firefox, then Hpricot should parse it.
  • Primarily, Hpricot is used for reading HTML and tries to sort out troubled HTML by having some idea of what good HTML is.



http://hpricot.com/


Bookmark and Share          4069



comments powered by Disqus


Related Products

Neko HTML Parser - simple HTML scanner

NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and fix up many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements. Automatically closes elements with optional end tags and can handle mismatched inline element tags.

Read more

JTidy - HTML parser and pretty printer in Java

JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.

Read more

TagSoup - SAX-compliant parser in Java

TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.

Read more

Arbica

Arabica is an XML and HTML processing toolkit, providing SAX, DOM, XPath, and partial XSLT implementations, written in Standard C++.

Read more

TagSoup - HTML/XML parser for Haskell

TagSoup is a library for parsing HTML/XML. It supports the HTML 5 specification, and can be used to parse either well-formed XML, or unstructured and malformed HTML from the web. The library also provides useful functions to extract information from an HTML document, making it ideal for screen-scraping.

Read more

Beautiful Soup - Python HTML/XML parser

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."

Read more

Boilerpipe - Fulltext Extraction from HTML pages

The boilerpipe library provides algorithms to detect and remove the surplus clutter (boilerplate, templates) around the main textual content of a web page. The library already provides specific strategies for common tasks (for example: news article extraction).

Read more

Expat

Expat is an XML parser library written in C. It is a stream-oriented parser in which an application registers handlers for things the parser might find in the XML document (like start tags).

Read more

PEG.js - Parser Generator for JavaScript

PEG.js is a simple parser generator for JavaScript that produces fast parsers with excellent error reporting. You can use it to process complex data or computer languages and build transformers, interpreters, compilers and other tools easily. It integrates both lexical and syntactical analysis.

Read more

SAXExpat

This is a SAX for .NET parser implementation based on the popular Expat XML parser.

Read more

Follow feeds Follow bestopensource on Twitter Follow bestopensource on Facebook

Enter your email address:

Delivered by FeedBurner



Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.