breadability - Reworked https://www

  •        12

I've tried to work with the various forks of some ancient codebase that ported readability to Python. The lack of tests, unused regex's, and commented out sections of code in other Python ports just drove me nuts. I put forth an effort to bring in several of the better forks into one code base, but they've diverged so much that I just can't work with it.

https://bookieio.github.io/breadability/
https://github.com/bookieio/breadability

Tags
Implementation
License
Platform

   




Related Projects

sumy - Module for automatic summarization of text documents and HTML pages.

  •    Python

Sumy contains command line utility for quick summarization of documents. Or you can use sumy like a library in your project. Create file sumy_example.py (don't name it sumy.py) with the code below to test it.

textract - node

  •    HTML

A text extraction node module. In almost all cases above, what textract cares about is the mime type. So .html and .htm, both possessing the same mime type, will be extracted. Other extensions that share mime types with those above should also extract successfully. For example, application/vnd.ms-excel is the mime type for .xls, but also for 5 other file types.

jusText - Heuristic based boilerplate removal tool

  •    Python

Program jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. You can try it online. This software has been developed at the Natural Language Processing Centre of Masaryk University in Brno with a financial support from PRESEMT and Lexical Computing Ltd. It also relates to PhD research of Jan Pomikálek.

Pandoc - General Markup Converter

  •    Haskell

Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. It an convert documents in markdown, reStructuredText, textile, HTML, DocBook, or LaTeX to HTML formats, Word processor formats, PDF and other markup formats.

rake-nltk - Python implementation of the Rapid Automatic Keyword Extraction algorithm using NLTK.

  •    Python

RAKE short for Rapid Automatic Keyword Extraction algorithm, is a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearance and its co-occurance with other words in the text. If you see a stopwords error, it means that you do not have the corpus stopwords downloaded from NLTK. You can download it using command below.


Pandoc - Universal markup converter

  •    Haskell

Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. Pandoc can read Markdown, CommonMark, PHP Markdown Extra, GitHub-Flavored Markdown, MultiMarkdown, and (subsets of) Textile, reStructuredText, HTML, LaTeX, MediaWiki markup, TWiki markup, TikiWiki markup, Creole 1.0, Haddock markup, OPML, Emacs Org mode, DocBook, JATS, Muse, txt2tags, Vimwiki, EPUB, ODT, and Word docx.

tabula - Tabula is a tool for liberating data tables trapped inside PDF files

  •    CSS

Repo Note: The master branch is an in development version of Tabula. This may be substantially different from the latest releases of Tabula.As of August 2015, the master branch (and Tabula 1.1.X+) uses tabula-java instead of tabula-extractor under the hood. Previous versions of Tabula use tabula-extractor.

RAKE - A python implementation of the Rapid Automatic Keyword Extraction

  •    Python

A Python implementation of the Rapid Automatic Keyword Extraction (RAKE) algorithm as described in: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic Keyword Extraction from Individual Documents. In M. W. Berry & J. Kogan (Eds.), Text Mining: Theory and Applications: John Wiley & Sons. The source code is released under the MIT License.

Gate - General Architecture for Text Engineering

  •    Java

GATE excels at text analysis of all shapes and sizes. It provides support for diverse language processing tasks such as parsers, morphology, tagging, Information Retrieval tools, Information Extraction components for various languages, and many others. It provides support to measure, evaluate, model and persist the data structure. It could analyze text or speech. It has built-in support for machine learning and also adds support for different implementation of machine learning via plugin.

docx4j - JAXB-based Java library for Word docx, Powerpoint pptx, and Excel xlsx files

  •    Java

docx4j is a library which helps you to work with the Office OpenXML file format as used in docx documents, pptx presentations, and xlsx spreadsheets.

iText - Java PDF library

  •    Java

iText is one of the popular and widely used PDF library. It is used to generate PDF documents dynamically. Mostly web developers will love it to generate PDF documents and reports based on data from an XML file or a database and serves it to the browser. It has support of adding bookmarks, watermarks, Encryption, Form filling and lot more.

Arachnode.net

  •    CSharp

An open source .NET web crawler written in C# using SQL 2005/2008. Arachnode.net is a complete and comprehensive .NET web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages.

maildown - Write your ActionMailer email templates in Markdown, send in html and plain text

  •    Ruby

So why not write your templates once in markdown, and have them translated to text and html? With Maildown now you can. In your app/views/<mailer> directory create a file with a .md.erb extension. When rails renders the email, it will generate html by parsing the markdown, and generate plain text by sending the email as is.

node-html-to-text - Advanced html to text converter

  •    Javascript

An advanced converter that parses HTML and returns beautiful text. It was mainly designed to transform HTML E-Mail templates to a text representation. So it is currently optimized for table layouts. By using the format option, you can specify formatting for these elements: text, image, lineBreak, paragraph, anchor, heading, table, orderedList, unorderedList, listItem, horizontalLine.

hget - :clap: Render websites in plain text from your terminal

  •    HTML

A CLI and an API to convert HTML into plain text. Can be used to fetch a site's HTML version and convert it into plain text, or to deliver plain text versions of your site dynamically.You can also convert HTML into HTML, ignoring certain document elements, and starting at a root element other than <html>. You can choose to take raw Markdown output as well, instead of the default terminal-formatted plain text.

Simple Text Processing Library

  •    C++

A simple text process library, aims to assist parsing all kinds of text including plain text, XML, HTML, etc., which means it can be used as a simple XML parser or a HTML parser.

SnappySnippet - Chrome extension that allows easy extraction of CSS and HTML from selected element.

  •    CSS

Chrome/Chromium extension that allows easy CSS+HTML extraction of specific DOM element. Created snippet can be then exported to CodePen, jsFiddle or JS Bin with one click. or download it and manually load as an 'Unpacked extension' via chrome extensions page.

Content-Based Cross-Site Web Data Mining

  •    

Content-Based Cross-Site Mining (CCM) of Web Data Records algorithm combines techniques of extracting data records based on the structure of documents (HTML tags) with an analysis of the semantics of the content for better data record extraction

Workflow HTML / Versioning HTML Module for DotNetNuke - by Effority.Net

  •    

Based on the "Text/HTML Core Module" the Text/HTML Workflow Module offers simple versioning and approval abilities for your Text/HTML Module content.