Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repositoriy of your choice (e.g. a search engine). It very flexible, powerful, easy to extend, and portable.

Norconex HTTP Collector - Enterprise Web Crawler

Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc. Its main feature include
  <UL>
	<LI>Fetching, parsing and indexation in parallel and distributed</LI>
	<LI>Plugin support</LI>
	<LI>Ontology</LI>
	<LI>Clustering</LI>
	<LI>Distributed filesystem (via Hadoop)</LI>
	<LI>Link-graph database</LI>
	<LI>NTLM authentication</LI>
	<LI>MapReduce</LI>
	<LI>Many formats: plain text, HTML, XML, ZIP, OpenDocument (OpenOffice.org), Microsoft Office (Word, Excel, Powerpoint), PDF, JavaScript, RSS, RTF, MP3 (ID3 tags)</LI>
 </UL>

Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

Nutch - Highly extensible, highly scalable Web crawler

An open source .NET web crawler written in C# using SQL 2005/2008. Arachnode.net is a complete and comprehensive .NET web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages. Its features include
  <UL>
	<LI>.NET architecture</LI>
	<LI>Configurable Rules and Actions</LI>
	<LI>Lucene.NET Integration</LI>
	<LI>SQL Server 2008 and full-text indexing</LI>
	<LI>.DOC/.PDF/.PPT/.XLS Indexing</LI>
	<LI>HTML to XML and XHTML</LI>
	<LI>Multi-threading and Throttling</LI>
	<LI>Respectful Crawling</LI>
	<LI>Analysis Services</LI>
	<LI>SQL Server 2008 and SSIS</LI>
	<LI>EXIF data extraction</LI>
 </UL>

An open source .NET web crawler written in C# using SQL 2005/2008. Arachnode.net is a complete and comprehensive .NET web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages.

Arachnode.net

Open Search Server is both a modern crawler and search engine and a suite of high-powered full text search algorithms. Built using the best open source technologies like lucene, zkoss, tomcat, poi, tagsoup. Open Search Server is a stable, high-performance piece of software.
   <UL>
	<LI>Multi-languages indexing</LI>
	<LI>The crawlers go through web sites and file systems to rapidly and easily build your index.</LI>
	<LI>Numerous document formats are supported, such as XML, HTML/XHTML, Adobe™ PDF, Microsoft™ Word™, PowerPoint™, OpenOffice™, etc</LI>
	<LI>Quick integration thanks to an XML interface via HTTP queries (XML over HTTP) and PHP classes</LI>
	<LI>The web interface is built around the power offered by the Zkoss framework. It runs with the main Ajax browsers. This RIA-type interface is as comfortable to use as that of a heavy client</LI>
   </UL>

Open Search Server is both a modern crawler and search engine and a suite of high-powered full text search algorithms. Built using the best open source technologies like lucene, zkoss, tomcat, poi, tagsoup. Open Search Server is a stable, high-performance piece of software.

Open Search Server

Grub Next Generation is distributed web crawling system (clients/servers) which helps to build and maintain index of the Web. It is client-server architecture where client crawls the web and updates the server. The peer-to-peer grubclient software crawls during computer idle time.

Grub

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix is designed to respect the robots.txt exclusion directives and META robots tags, and collect material at a measured, adaptive pace unlikely to disrupt normal website activity. <BR><BR>It provides web interface for operator control and monitoring of crawls. It stores content to ARC or ISO WARC aggregate/transcript format.

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix is designed to respect the robots.txt exclusion directives and META robots tags, and collect material at a measured, adaptive pace unlikely to disrupt normal website activity.

Heritrix

ASPseek is an Internet search engine software developed by SWsoft.ASPseek consists of an indexing robot, a search daemon, and a CGI search frontend. It can index as many as a few million URLs and search for words and phrases, use wildcards, and do a Boolean search. Search results can be limited to time period given, site or Web space (set of sites) and sorted by relevance (PageRank is used) or date. <BR><BR>ASPseek is optimized for multiple sites (threaded index, async DNS lookups, grouping results by site, Web spaces), but can be used for searching one site as well. ASPseek can work with multiple languages/encodings at once (including multibyte encodings such as Chinese) due to Unicode storage mode. Other features include stopwords and ispell support, a charset and language guesser, HTML templates for search results, excerpts, and query words highlighting.

ASPseek is an Internet search engine software developed by SWsoft.ASPseek consists of an indexing robot, a search daemon, and a CGI search frontend. It can index as many as a few million URLs and search for words and phrases, use wildcards, and do a Boolean search. Search results can be limited to time period given, site or Web space (set of sites) and sorted by relevance (PageRank is used) or date.

ASPseek

mnoGoSearch for UNIX consists of a command line indexer and a search program which can be run under Apache Web Server, or any other HTTP server supporting CGI interface. mnoGoSearch for Unix is distributed in sources and can be compiled with a number of databases, depending on user's choice. It is known to work on a wide variety of the modern Unix operating systems including Linux, FreeBSD, Mac OSX, Solaris and others.

mnoGoSearch

Discover open source projects across all platforms

Projects