Displaying 1 to 3 from 3 results

Norconex HTTP Collector - Enterprise Web Crawler

  •    Java

Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repositoriy of your choice (e.g. a search engine). It very flexible, powerful, easy to extend, and portable.

Nutch - Highly extensible, highly scalable Web crawler

  •    Java

Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

Heritrix

  •    Java

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix is designed to respect the robots.txt exclusion directives and META robots tags, and collect material at a measured, adaptive pace unlikely to disrupt normal website activity.