Displaying 1 to 6 from 6 results

Norconex HTTP Collector - Enterprise Web Crawler

  •    Java

Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repositoriy of your choice (e.g. a search engine). It very flexible, powerful, easy to extend, and portable.

Nutch - Highly extensible, highly scalable Web crawler

  •    Java

Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

Grub

  •    CSharp

Grub Next Generation is distributed web crawling system (clients/servers) which helps to build and maintain index of the Web. It is client-server architecture where client crawls the web and updates the server. The peer-to-peer grubclient software crawls during computer idle time.

Open Search Server

  •    C++

Open Search Server is both a modern crawler and search engine and a suite of high-powered full text search algorithms. Built using the best open source technologies like lucene, zkoss, tomcat, poi, tagsoup. Open Search Server is a stable, high-performance piece of software.




Heritrix

  •    Java

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix is designed to respect the robots.txt exclusion directives and META robots tags, and collect material at a measured, adaptive pace unlikely to disrupt normal website activity.

algolia-webcrawler - Simple node worker that crawls sitemaps in order to keep an algolia index up-to-date

  •    Javascript

Simple node worker that crawls sitemaps in order to keep an Algolia index up-to-date. It uses simple CSS selectors in order to find the actual text content to index.