Displaying 1 to 10 from 10 results

parse5 - HTML parsing/serialization toolset for Node

  •    Javascript

HTML parsing/serialization toolset for Node.js. WHATWG HTML Living Standard (aka HTML5)-compliant.parse5 provides nearly everything you may need when dealing with HTML. It's the fastest spec-compliant HTML parser for Node to date. It parses HTML the way the latest version of your browser does. It has proven itself reliable in such projects as jsdom, Angular2, Polymer and many more.

goquery - A little like that j-thing, only in Go.

  •    Go

goquery brings a syntax and a set of features similar to jQuery to the Go language. It is based on Go's net/html package and the CSS Selector library cascadia. Since the net/html parser returns nodes, and not a full-featured DOM tree, jQuery's stateful manipulation functions (like height(), css(), detach()) have been left off.Also, because the net/html parser requires UTF-8 encoding, so does goquery: it is the caller's responsibility to ensure that the source document provides UTF-8 encoded HTML. See the wiki for various options to do this.

Fuzi - A fast & lightweight XML & HTML parser in Swift with XPath & CSS support

  •    Swift

Fuzi is based on a Swift port of Mattt Thompson's Ono(斧), using most of its low level implementaions with moderate class & interface redesign following standard Swift conventions, along with several bug fixes. Fuzi(斧子) means "axe", in homage to Ono(斧), which in turn is inspired by Nokogiri (鋸), which means "saw".

scala-scraper - A Scala library for scraping content from HTML pages

  •    Scala

A library providing a DSL for loading and extracting content from HTML pages. Take a look at Examples.scala and at the unit specs for usage examples or keep reading for more thorough documentation. Feel free to use GitHub Issues for submitting any bug or feature request and Gitter to ask questions.




RichSnippetExtractor (By NowFloats)

  •    DotNet

A library to extract Rich Snippet from HTML source documents or direct URL

jusText - Heuristic based boilerplate removal tool

  •    Python

Program jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. You can try it online. This software has been developed at the Natural Language Processing Centre of Masaryk University in Brno with a financial support from PRESEMT and Lexical Computing Ltd. It also relates to PhD research of Jan Pomikálek.

XML-Parser - A Node.js XML DOM, Parser & Stringifier.

  •    Javascript

Parse XML, HTML and more with a very tolerant XML parser and convert it into a DOM. These three components are separated from each other as own modules.


breadability - Reworked https://www

  •    HTML

I've tried to work with the various forks of some ancient codebase that ported readability to Python. The lack of tests, unused regex's, and commented out sections of code in other Python ports just drove me nuts. I put forth an effort to bring in several of the better forks into one code base, but they've diverged so much that I just can't work with it.

jfiveparse - A java html5 compliant parser

  •    Java

jfiveparse pass all the non scripted tests for the tokenizer and tree construction from the html5lib-tests suite. It provide both fragment and full document parsing. It can parse directly from a String or by streaming through a Reader (note: the encoding must be known, currently the parser does not implement an autodetect feature).





We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.