Boilerpipe - Fulltext Extraction from HTML pages

  •        0

The boilerpipe library provides algorithms to detect and remove the surplus clutter (boilerplate, templates) around the main textual content of a web page. The library already provides specific strategies for common tasks (for example: news article extraction).




comments powered by Disqus

Related Projects

TagSoup - SAX-compliant parser in Java

TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.


Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

Tablesorter -Flexible client-side table sorting

Tablesorter is a jQuery plugin for turning a standard HTML table with THEAD and TBODY tags into a sortable table without page refreshes. tablesorter can successfully parse and sort many types of data including linked data in a cell.

PDFBox - Java PDF library

Apache PDFBox is an open source Java PDF library for working with PDF documents. This library allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. It provides support for adding bookmarks, fonts, text extraction, Encryption, PDF printing and lot more.

JTidy - HTML parser and pretty printer in Java

JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.

ANTLR - ANother Tool for Language Recognition

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files. It's widely used to build languages, tools, and frameworks. From a grammar, ANTLR generates a parser that can build and walk parse trees. Twitter search uses ANTLR for query parsing, with over 2 billion queries a day.

Html Agility Pack

This is an HTML parser that builds a read/write DOM from “real world” HTML files. It supports XPATH or XSLT and is tolerant with "real world" malformed HTML.

iText - Java PDF library

iText is one of the popular and widely used PDF library. It is used to generate PDF documents dynamically. Mostly web developers will love it to generate PDF documents and reports based on data from an XML file or a database and serves it to the browser. It has support of adding bookmarks, watermarks, Encryption, Form filling and lot more.

HtmlCleaner - HTML parser in Java

HtmlCleaner is HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.

TagSoup - HTML/XML parser for Haskell

TagSoup is a library for parsing HTML/XML. It supports the HTML 5 specification, and can be used to parse either well-formed XML, or unstructured and malformed HTML from the web. The library also provides useful functions to extract information from an HTML document, making it ideal for screen-scraping.