graby - Graby extract article content from web pages. This is a fork of Full-Text RSS v3.3

  •        14

Graby helps you extract article content from web pages. Also, if you want to understand how things work internally, it's really hard to read and understand. And finally, there are no tests at all.

https://github.com/j0k3r/graby

Tags
Implementation
License
Platform

   




Related Projects

dragnet - Just the facts -- web page content extraction

  •    Python

Dragnet isn't interested in the shiny chrome or boilerplate dressing of a web page. It's interested in... 'just the facts.' The machine learning models in Dragnet extract the main article content and optionally user generated comments from a web page. They provide state of the art performance on variety of test benchmarks. This project was originally inspired by Kohlschütter et al, Boilerplate Detection using Shallow Text Features and Weninger et al CETR -- Content Extraction with Tag Ratios, and more recently by Readability.

OpenPipe - Document Pipeline

  •    Java

OpenPipe is an open source scalable platform for manipulating a stream of documents. A pipeline is an ordered set of steps / operations performed on a document to convert from its raw form to something ready to be put into the index. The operations performed on documents include language detection, field manipulation, POS tagging, entity extraction or submitting the document to a search engine.

node-read - Get Readable Content from any page

  •    Javascript

Get Clean Reading Content from every web page

Pligg - Social Publishing CMS

  •    PHP

Pligg is an open source CMS (Content Management System) which provides social publishing software that encourages visitors to register on your website so that they can submit content and connect with other users. Our software creates websites where stories are created and voted on by members, not website editors. It is a user driven CMS that relies on independent authors content and participation to manage news articles.


ruby-readability - Port of arc90's readability project to Ruby

  •    Ruby

Ruby Readability is a tool for extracting the primary readable content of a webpage. It is a Ruby port of arc90's readability project. Readability comes with a command-line tool for experimentation in bin/readability.

Aperture - Java framework for getting data and metadata

  •    Java

Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems. It could crawl and extract information from File system, Websites, Mail boxes and Mail servers. It supports various file formats like Office, PDF, Zip and lot more. Metadata information is extracted from image files. Aperture has a strong focus on semantics, metadata extracted could be mapped to predefined properties.

Rainbow - portal development made easy

  •    ASPNET

Rainbow CMS available today in 29 languages, allows content authoring to be safely delegated to role-based team members who need little or no knowledge of HTML. Rainbow optionally supports a two-step approval-publish process. 75 plug-in modules are now included in the standard release. It is also fairly easy to build your own custom modules.

Plone

  •    Python

Plone lets non-technical people create and maintain information using only a web browser. Perfect for web sites or intranets, Plone offers superior security without sacrificing extensibility or ease of use.

sitecheck

  •    Python

Modular web site spider for web developers.

TextWrapper

  •    

A IIS managed module that enables word wrap of plain text content. Supports GZip and Deflate encoding. This module increases readability of text files that contain long lines.

DotNetNuke

  •    ASPNET

DotNetNuke is the most widely adopted web content management system (WCM or CMS) and application development platform for building web sites and web applications on Microsoft .NET.

Pebble - Java EE blogging tool

  •    Java

Pebble is a lightweight, Java EE blogging tool. It's small, fast and feature-rich with unrivalled ease of installation and use. Blog content is stored as XML files on disk and served up dynamically, so there's no need to install a database. All maintenance and administration can be performed through your web browser, making Pebble ideal for anybody who is constantly on the move or doesn't have direct access to their host.

ocPortal - Advanced CMS with many features

  •    PHP

ocPortal CMS have all the features you would expect from a website engine: for instance photo galleries, news, file downloads and community forums/chats, but it does so whilst meeting the highest accessibility and professional standards. It is also smart enough to go beyond page management, to automatically handle search engine optimisation, and provide aggressive hack attack prevention.

pangu.js - 為什麼你們就是不能加個空格呢?

  •    Javascript

Paranoid text spacing for good readability, to automatically insert whitespace between CJK (Chinese, Japanese, Korean) and half-width characters (alphabetical letters, numerical digits and symbols).

extract-text-webpack-plugin - Extracts text from bundle into a file

  •    Javascript

Extract text from a bundle, or bundles, into a separate file. ⚠️ Since webpack v4 the extract-text-webpack-plugin should not be used for css. Use mini-css-extract-plugin instead.

Nexus - Repository Manager

  •    Java

Nexus manages software artifacts required for development, deployment, and provisioning. Nexus can share those artifacts with other developers and end-users. It is integrated with Eclipse. It stores the content in the file system and no requirement for database. Full text search support is provided by indexing the repository content.

Typo3

  •    PHP

TYPO3 is an enterprise class Web CMS written in PHP/MySQL. It's designed to be extended with custom written backend modules and frontend libraries for special functionality. It has very powerful integration of image manipulation.

PDFBox - Java PDF library

  •    Java

Apache PDFBox is an open source Java PDF library for working with PDF documents. This library allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. It provides support for adding bookmarks, fonts, text extraction, Encryption, PDF printing and lot more.

yuulia content manager

  •    VBScript

Content Manager for building a completely maintainable website in less than 1 hour. Forget the overblown and too complicated Content Management Systems out there. yuulia helps you to create websites out-of-the-box. Supports RSS, CSS, FCKEditor, etc.