WebHarvest - web data extraction tool

  •        0

Web data extraction (web data mining, web scraping) tool. It leverages well proved XML and text processing techologies in order to easely extract useful data from arbitrary web pages.

http://web-harvest.sourceforge.net

Tags
Implementation
License
Platform

   

comments powered by Disqus


Related Projects

Web Curator Tool


The Web Curator Tool is a tool for managing the selective web harvesting process. It is designed for use in libraries by non-technical users, while allowing complete control of the harvesting process.

Aggregatord - Metadata harvester


Java metadata harvester used by the Folksemantic tools developed at the Center for Open Sustainable Learning at Utah State University. Utilizes the ROME library to support harvesting the various flavors of RSS. It includes extensions for harvesting metadata from the NSDL OAI server. It also supports harvesting microformats from web pages.

Ceemtu - File indexer and metadata harvester


File indexer and metadata harvester with special focus on handling images, music and movies. Uses django as centralized web-frontend and wxPython for GUI-clients.

Spiketrap - Trap and poison email harvesters


SpikeTrap traps email address harvesters and spiders that do not obey the Robots Exclusion Standard. It automatically generates slow-loading web pages full of nonexistent email addresses, and links back to itself repeatedly with different URLs. When email harvesters encounter these pages, they will pollute their indexes with potentially thousands of bogus emails, follow the links to pages that provide even more fake addresses, and waste their time loading all the intentionally slowed pages.

Harvestman-crawler - A modular, flexible, extensible, multi-threaded web crawler framework/applicati


HarvestMan is a modular, extensible and flexible web crawler program cum framework written in pure Python. HarvestMan can be used to download files from websites according to a number of customized rules and constraints. It can be used to find information from websites matching keywords or regular expressions. The final goal of the project is to develop a full-fledged semantic personal data mining platform which can be used to retrieve information from the Internet in a highly customizable manne

Southcomb - Ruby on Rails suite of for metadata aggregation in a specific knowledge domain


The SouthComb ProjectSouthcomb is a suite of tools for creating a website which is a catalog of a metadata harvested and classified from web pages, oai_servers and rss feeds. It includes the following tools SouthCombRails

Oaisearch - Open Archives Initiative Search


The OAI Search is a means of harvesting Open Archives Initiative compliant repositories and indexing them using Lucene. The OAIS web interface is designed to be simple yet have many useful search features.Written in Python.

Email Protector


Avoid SPAM! Email Protector protects the email addresses on your website. It's an easy to use ASP.Net web control that hides email addresses from website email harvesters. The email address is encrypted using the XTea algorithm. (http://en.wikipedia.org/wiki/XTEA)

OAI-PMH Harvester Manager


OAI-PMH Harvester Manager is an Web Application that manages both one time or regularly repeating harvesting jobs using Open Archives Initiative Protocol for Metadata Harvesting.

Harvestcms - Harvest content management system


A web content management system that is object oriented. Yes, this is yet another content management system, built mostly as a hobby, but someday it might be great.