WebHarvest - web data extraction tool

  •        0

Web data extraction (web data mining, web scraping) tool. It leverages well proved XML and text processing techologies in order to easely extract useful data from arbitrary web pages.




Related Projects

Web Curator Tool

The Web Curator Tool is a tool for managing the selective web harvesting process. It is designed for use in libraries by non-technical users, while allowing complete control of the harvesting process.

Aggregatord - Metadata harvester

Java metadata harvester used by the Folksemantic tools developed at the Center for Open Sustainable Learning at Utah State University. Utilizes the ROME library to support harvesting the various flavors of RSS. It includes extensions for harvesting metadata from the NSDL OAI server. It also supports harvesting microformats from web pages.

Ceemtu - File indexer and metadata harvester

File indexer and metadata harvester with special focus on handling images, music and movies. Uses django as centralized web-frontend and wxPython for GUI-clients.

Spiketrap - Trap and poison email harvesters

SpikeTrap traps email address harvesters and spiders that do not obey the Robots Exclusion Standard. It automatically generates slow-loading web pages full of nonexistent email addresses, and links back to itself repeatedly with different URLs. When email harvesters encounter these pages, they will pollute their indexes with potentially thousands of bogus emails, follow the links to pages that provide even more fake addresses, and waste their time loading all the intentionally slowed pages.

Harvestman-crawler - A modular, flexible, extensible, multi-threaded web crawler framework/applicati

HarvestMan is a modular, extensible and flexible web crawler program cum framework written in pure Python. HarvestMan can be used to download files from websites according to a number of customized rules and constraints. It can be used to find information from websites matching keywords or regular expressions. The final goal of the project is to develop a full-fledged semantic personal data mining platform which can be used to retrieve information from the Internet in a highly customizable manne

Southcomb - Ruby on Rails suite of for metadata aggregation in a specific knowledge domain

The SouthComb ProjectSouthcomb is a suite of tools for creating a website which is a catalog of a metadata harvested and classified from web pages, oai_servers and rss feeds. It includes the following tools SouthCombRails

Oaisearch - Open Archives Initiative Search

The OAI Search is a means of harvesting Open Archives Initiative compliant repositories and indexing them using Lucene. The OAIS web interface is designed to be simple yet have many useful search features.Written in Python.

Email Protector

Avoid SPAM! Email Protector protects the email addresses on your website. It's an easy to use ASP.Net web control that hides email addresses from website email harvesters. The email address is encrypted using the XTea algorithm. (http://en.wikipedia.org/wiki/XTEA)

OAI-PMH Harvester Manager

OAI-PMH Harvester Manager is an Web Application that manages both one time or regularly repeating harvesting jobs using Open Archives Initiative Protocol for Metadata Harvesting.

Harvestcms - Harvest content management system

A web content management system that is object oriented. Yes, this is yet another content management system, built mostly as a hobby, but someday it might be great.