TagSoup - SAX-compliant parser in Java

TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.



http://home.ccil.org/~cowan/XML/tagsoup/

Bookmark and Share          4077



comments powered by Disqus


Related Products

Jssaxparser - A SAX 2 parser written in Javascript

Javascript SAX 2 ParserA light weight JavaScript SAX 2 parser which reads an XML text and triggers standardized SAX 2 events. IntroductionThat parser is able to read XML and its associated DTD. It will throw the events of : contentHandler errorHandler dtdHandler entityResolver declarationHandler lexicalHandler conforming to specification at http://www.saxproject.org/ . How to use itImport library<script type="text/javascript" src="../jssaxparser/sax.js"></script><script type="text/javascript" sr

Read more

Arbica

Arabica is an XML and HTML processing toolkit, providing SAX, DOM, XPath, and partial XSLT implementations, written in Standard C++.

Read more

Xmlcc - A platform independent object-oriented C++ library for generating, writing and parsing XML a

XMLCCXMLCC is a C++ library for handling XML using Design Patterns especially the Composite Pattern. AboutXMLCC allows for generating XML structures using a hierarchical object-oriented model that can be written to an XML file easily. Parsing is available by several parsers; a DOM like parser building the complete object-oriented model that can be searched for XML tags afterwards, or a SAX like parser that can by specialized to an XML structure by implementing an API. Both parsers are char by ch

Read more

Seedparser - SAX Based Extended Event Driven (SEED) XML Parser

SAX based Extended Event Driven XML Parser for C++ Introduction Building DOM tree is the simplest way to parse an XML document and extract details. It is widely used in a variety of applications spanning business domains - simple desktop applications, enterprise applications and large scale web applications. In a large scale application or in B2B scenario the XML document tends to quiet huge in the order 100s of GBs, in which case building a DOM tree would simply not be possible because of memor

Read more

Xpath4sax - XPath for SAX XML Parser

A quick XPath analyser with a SAX Parser. Some syntaxes are invalide, but all using syntax are presents. It's possible to catch many XPath in the same time. XPathXMLHandler handler=new XPathXMLHandler() { @Override public void findXPathNode(SAXXPath xpath, Object node) { System.out.println("node="+node); } };handler.setXPaths(XPathXMLHandler.toXPaths("//b[@at_a='s3']/c"));SAXParser parser = SAXParserFactory.newInstance().newSAXParser();parser.parse(new InputSource(new StringReader(xml)), handler

Read more

TagSoup - HTML/XML parser for Haskell

TagSoup is a library for parsing HTML/XML. It supports the HTML 5 specification, and can be used to parse either well-formed XML, or unstructured and malformed HTML from the web. The library also provides useful functions to extract information from an HTML document, making it ideal for screen-scraping.

Read more

Core-xml - corexml for students

This is website that allow students learning XML and disscutl! At the end of this course, students will be able to: Outline the features of markup languages and list their drawbacks. Outline the features of markup languages and list their drawbacks. Define and describe XML. State the benefits and scope of XML. Describe the structure of an XML document. Explain the lifecycle of an XML document. State the functions of editors for XML and list the popularly used editors. State the functions of pars

Read more

Wikixmlj - A Java API to parse Wikipedia XML dumps

(Part of the larger WikiSense project aimed at understanding Wikipedia for semantic annotation of texts) WikiXMLJ provides easy access to Wikipedia XML dumps. Latest (r43): What's new? Speedup in SAX parsing ( issue #9 ) Features: Easy access to important elements of a Wikipedia page Also provides interfaces for Wiki text parsing. Memory efficient SAX interface for parsing Lazy loading of files for DOM Callback support with DOM Directly operate on compressed wikipedia dumps (gzip/bzip2/native xm

Read more

Piccolo

Piccolo is a small, extremely fast XML parser for Java. It implements the SAX 1, SAX 2.0.1, and JAXP 1.1 (SAX parsing only) interfaces as a non-validating parser and attempts to detect all XML well-formedness errors. Piccolo was developed by Yuval Oren.

Read more

Easyxmldata - An extension to the SAX Parser, which eases the creation of classes that represent XML

easyxmldata is a small package that uses the SAX Parser to making parsing a little bit easier. The package offers an Interface, Utilites and a Parser which allows an easy XML Parser implementation for custom XML elements and corresponding java classes. You want to parse an XML document and create corresponding classes that will hold the XML parsed XML elements. This package allows you to create your own classes that will represent the XML elements. For each possible tag that can be foud in the X

Read more

Related Tags
Browse projects by tags.

We have collection of more than 400,000 open source products ranging from Enterprise product to small libraries in all platforms. We aggregate information from all open source repositories. Search and find the best for your needs.



Follow feeds Follow bestopensource on Twitter Follow bestopensource on Facebook


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.

Do you provide Consulting, Training, Support for any open source products. Register your business

Tag Cloud >>