sumy - Module for automatic summarization of text documents and HTML pages.

  •        29

Sumy contains command line utility for quick summarization of documents. Or you can use sumy like a library in your project. Create file (don't name it with the code below to test it.



Related Projects

textract - node

  •    HTML

A text extraction node module. In almost all cases above, what textract cares about is the mime type. So .html and .htm, both possessing the same mime type, will be extracted. Other extensions that share mime types with those above should also extract successfully. For example, application/ is the mime type for .xls, but also for 5 other file types.

Pandoc - General Markup Converter

  •    Haskell

Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. It an convert documents in markdown, reStructuredText, textile, HTML, DocBook, or LaTeX to HTML formats, Word processor formats, PDF and other markup formats.

SnappySnippet - Chrome extension that allows easy extraction of CSS and HTML from selected element.

  •    CSS

Chrome/Chromium extension that allows easy CSS+HTML extraction of specific DOM element. Created snippet can be then exported to CodePen, jsFiddle or JS Bin with one click. or download it and manually load as an 'Unpacked extension' via chrome extensions page.

rake-nltk - Python implementation of the Rapid Automatic Keyword Extraction algorithm using NLTK.

  •    Python

RAKE short for Rapid Automatic Keyword Extraction algorithm, is a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearance and its co-occurance with other words in the text. If you see a stopwords error, it means that you do not have the corpus stopwords downloaded from NLTK. You can download it using command below.

jusText - Heuristic based boilerplate removal tool

  •    Python

Program jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. You can try it online. This software has been developed at the Natural Language Processing Centre of Masaryk University in Brno with a financial support from PRESEMT and Lexical Computing Ltd. It also relates to PhD research of Jan Pomikálek.

Pandoc - Universal markup converter

  •    Haskell

Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. Pandoc can read Markdown, CommonMark, PHP Markdown Extra, GitHub-Flavored Markdown, MultiMarkdown, and (subsets of) Textile, reStructuredText, HTML, LaTeX, MediaWiki markup, TWiki markup, TikiWiki markup, Creole 1.0, Haddock markup, OPML, Emacs Org mode, DocBook, JATS, Muse, txt2tags, Vimwiki, EPUB, ODT, and Word docx.


  •    Delphi

Analyzator is a HTML page analyser and parser intended to extract content from a HTML page and store it into a XML file. The extraction process is controlled by a simple script language designed for this purpose.

TCPDF - PHP class for generating PDF

  •    PHP

TCPDF is a PHP class for generating PDF documents without requiring external extensions. TCPDF Supports UTF-8, Unicode, RTL languages, XHTML, Javascript, digital signatures, barcodes and much more.

tabula - Tabula is a tool for liberating data tables trapped inside PDF files

  •    CSS

Repo Note: The master branch is an in development version of Tabula. This may be substantially different from the latest releases of Tabula.As of August 2015, the master branch (and Tabula 1.1.X+) uses tabula-java instead of tabula-extractor under the hood. Previous versions of Tabula use tabula-extractor.


  •    Perl

HTML-TableExtract is a perl module that allows for easy and flexible extraction of information contained in tables within HTML documents. It is a subclass of HTML::Parser.

tabula-extractor - Extract tables from PDF files

  •    Ruby

Deprecation Note: This is the old version of the Tabula extraction engine. New projects wishing to integrate Tabula should use tabula-java (the new Java version of this extraction engine) unless you prefer to use JRuby. Users looking for the command-line version of Tabula should also use tabula-java.Extract tables from PDF files. tabula-extractor is the table extraction engine that used to power Tabula.

HTML Parser

  •    Java

HTML Parser is a Java library used to parse HTML in either a linear or nested fashion. Primarily used for transformation or extraction, it features filters, visitors, custom tags and easy to use JavaBeans. It is a fast, robust and well tested package.

docx4j - JAXB-based Java library for Word docx, Powerpoint pptx, and Excel xlsx files

  •    Java

docx4j is a library which helps you to work with the Office OpenXML file format as used in docx documents, pptx presentations, and xlsx spreadsheets.

iText - Java PDF library

  •    Java

iText is one of the popular and widely used PDF library. It is used to generate PDF documents dynamically. Mostly web developers will love it to generate PDF documents and reports based on data from an XML file or a database and serves it to the browser. It has support of adding bookmarks, watermarks, Encryption, Form filling and lot more.

  •    CSharp

An open source .NET web crawler written in C# using SQL 2005/2008. is a complete and comprehensive .NET web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages.

.net HTTP Data Extractor

  •    LINQ

The nWeb Data Extractor Library provides support for extracting data from the http response html, it allows user to convert http response HTML to XML, then allows user to extract desired data form the generated xml file.

RAKE - A python implementation of the Rapid Automatic Keyword Extraction

  •    Python

A Python implementation of the Rapid Automatic Keyword Extraction (RAKE) algorithm as described in: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic Keyword Extraction from Individual Documents. In M. W. Berry & J. Kogan (Eds.), Text Mining: Theory and Applications: John Wiley & Sons. The source code is released under the MIT License.

html5-test-page - A page filled with common HTML elements to be used for testing purposes

  •    HTML

This is a test page filled with common HTML elements to be used to provide visual feedback whilst building CSS systems and frameworks. I have not been able to find a dead-simple, standalone test page out there, so I made one. The HTML elements are loosely categorized as either text, embedded content, or form elements.

WebCrawler and Entity Extraction using Fetch and process frame work


Web Crawler using Fetch And Process Framework. Yes , it does processing of robots.txt