htmlparser2 - forgiving html and xml parser

  •        51

A forgiving HTML/XML/RSS parser. The parser can handle streams and provides a callback interface. A live demo of htmlparser2 is available here.

https://github.com/fb55/htmlparser2

Dependencies:

domelementtype : ^1.3.0
domhandler : ^2.3.0
domutils : ^1.5.1
entities : ^1.1.1
inherits : ^2.0.1
readable-stream : ^2.0.2

Tags
Implementation
License
Platform

   




Related Projects

node-htmlparser - Forgiving HTML/XML/RSS Parser in JS for *both* Node and Browsers

  •    Javascript

#NodeHtmlParser A forgiving HTML/XML/RSS parser written in JS for both the browser and NodeJS (yes, despite the name it works just fine in any modern browser). The parser can handle streams (chunked data) and supports custom handlers for writing custom DOMs/output.

Nokogiri - HTML, XML, SAX, and Reader parser with XPath and CSS selector support

  •    Ruby

Nokogiri (?) is an HTML, XML, SAX, DOM parser. Among Nokogiri's many features is the ability to search documents via XPath or CSS3 selectors, XML/HTML builder, XSLT transformer. Nokogiri parses and searches XML/HTML using native libraries (either C or Java, depending on your Ruby), which means it's fast and standards-compliant.

RSParser - Parser for RSS, Atom, JSON Feed, RSS-inJSON, OPML, and HTML.

  •    HTML

This framework was developed for NetNewsWire and is made available here for developers who just need the parsing code. It has no depencies that aren’t provided by the system. It also includes Objective-C wrappers for libXML2’s XML SAX and HTML SAX parsers. You can write your own parsers on top of these.

gofeed - Parse RSS and Atom feeds in Go

  •    Go

The gofeed library is a robust feed parser that supports parsing both RSS and Atom feeds. The universal gofeed.Parser will parse and convert all feed types into a hybrid gofeed.Feed model. You also have the option of parsing them into their respective atom.Feed and rss.Feed models using the feed specific atom.Parser or rss.Parser.It also provides support for parsing several popular predefined extension modules, including Dublin Core and Apple’s iTunes, as well as arbitrary extensions. See the Extensions section for more details.


AngleSharp - The ultimate angle brackets parser library parsing HTML5, MathML, SVG and CSS to construct a DOM based on the official W3C specifications

  •    CSharp

AngleSharp is a .NET library that gives you the ability to parse angle bracket based hyper-texts like HTML, SVG, and MathML. XML without validation is also supported by the library. An important aspect of AngleSharp is that CSS can also be parsed. The included parser is built upon the official W3C specification. This produces a perfectly portable HTML5 DOM representation of the given source code and ensures compatibility with results in evergreen browsers. Also standard DOM features such as querySelector or querySelectorAll work for tree traversal.

ROME

  •    Java

ROME is an set of Java tools for parsing, generating and publishing RSS and Atom feeds. The core ROME library depends only on the JDOM XML parser and supports parsing, generating and converting all of the popular RSS and Atom formats including RSS 0.90, RSS 0.91 Netscape, RSS 0.91 Userland, RSS 0.92, RSS 0.93, RSS 0.94, RSS 1.0, RSS 2.0, Atom 0.3, and Atom 1.0. You can parse to an RSS object model, an Atom object model or an abstract SyndFeed model that can model either family of formats.

Arbica

  •    C++

Arabica is an XML and HTML processing toolkit, providing SAX, DOM, XPath, and partial XSLT implementations, written in Standard C++.

feedhq - FeedHQ is a web-based feed reader

  •    Python

Then deploy the Django app using the recipe that fits your installation. More documentation on the Django deployment guide. The WSGI application is located at feedhq.wsgi.application.

html-agility-pack - Html Agility Pack (HAP)

  •    CSharp

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).Have a question? Ask questions and find answers from over 2500 questions.

MagpieRSS - XML-based RSS parser in PHP

  •    PHP

MagpieRSS is compatible with RSS 0.9 through RSS 1.0. Also parses RSS 1.0's modules, RSS 2.0, and Atom. (with a few exceptions)

FeedParser - Parse RSS and Atom feeds in Python

  •    Python

Universal Feed Parser is a Python module for parsing syndicated feeds. It can handle RSS 0.90, Netscape RSS 0.91, Userland RSS 0.91, RSS 0.92, RSS 0.93, RSS 0.94, RSS 1.0, RSS 2.0, Atom 0.3, Atom 1.0, and CDF feeds. It also parses several popular extension modules, including Dublin Core and Apple's iTunes extensions.

TagSoup - HTML/XML parser for Haskell

  •    Haskell

TagSoup is a library for parsing HTML/XML. It supports the HTML 5 specification, and can be used to parse either well-formed XML, or unstructured and malformed HTML from the web. The library also provides useful functions to extract information from an HTML document, making it ideal for screen-scraping.

php-simple-html-dom-parser - PHP Simple HTML DOM Parser adaptation for Composer and PSR-0

  •    HTML

A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way! Require PHP 5+. Supports invalid HTML. Find tags on an HTML page with selectors just like jQuery. Extract contents from HTML in a single line.

posthtml - PostHTML is a tool to transform HTML/XML with JS plugins

  •    Javascript

PostHTML is a tool for transforming HTML/XML with JS plugins. PostHTML itself is very small. It includes only a HTML parser, a HTML node tree API and a node tree stringifier. All HTML transformations are made by plugins. And these plugins are just small plain JS functions, which receive a HTML node tree, transform it, and return a modified tree.

Cowzilla The Feed Monster

  •    PHP

Cowzilla is Atom 0.3 parser written as a PHP class. It has the functionality to parse and convert Atom to RSS 2.0 (atom2rss), Atom to a HTML document, and check your GMail.

JTidy - HTML parser and pretty printer in Java

  •    Java

JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.

Html Agility Pack

  •    

This is an HTML parser that builds a read/write DOM from “real world” HTML files. It supports XPATH or XSLT and is tolerant with "real world" malformed HTML.

RSS Parser and XML Parser for PHP 5+

  •    PHP

A full XML Parser for PHP with RSS Parser specific functionsl; think of it as an interface to the PHP DOM which allows easy access to your XML based documents. Auto encoding conversion to UTF-8 + Array to XML Conversion. V3 is now a commercial product

feedparser - A Cocoa RSS/Atom parser for Mac OS X and the iPhone

  •    Objective-C

FeedParser is an NSXMLParser-based RSS/Atom feed parser for Cocoa. It is intended to parse well-formed RSS and Atom feeds on both the desktop and the iPhone. The simplest way to use FeedParser is to simply add the FeedParser directory to your project. FeedParser also includes a static library target if you prefer to include it that way.





We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.