ferret is a web scraping system. It aims to simplify data extraction from the web for UI testing, machine learning, analytics and more. ferret allows users to focus on the data. It abstracts away the technical details and complexity of underlying technologies using its own declarative language. It is extremely portable, extensible, and fast. It as the ability to scrape JS rendered pages, handle all page events and emulate user interactions. 
 
ferret uses Chrome/Chromium via Chrome Devtools Protocol to handle dynamically rendered web pages. It helps you to focus on the data you need using an easy to learn declarative language.

ferret is a web scraping system. It aims to simplify data extraction from the web for UI testing, machine learning, analytics and more.  ferret allows users to focus on the data. It abstracts away the technical details and complexity of underlying technologies using its own declarative language. It is extremely portable, extensible, and fast. It as the ability to scrape JS rendered pages, handle all page events and emulate user interactions. 

ferret - Declarative web scraping

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Scrapy - Web crawling & scraping framework for Python

An open source .NET web crawler written in C# using SQL 2005/2008. Arachnode.net is a complete and comprehensive .NET web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages. Its features include
 <UL>
	<LI>.NET architecture</LI>
	<LI>Configurable Rules and Actions</LI>
	<LI>Lucene.NET Integration</LI>
	<LI>SQL Server 2008 and full-text indexing</LI>
	<LI>.DOC/.PDF/.PPT/.XLS Indexing</LI>
	<LI>HTML to XML and XHTML</LI>
	<LI>Multi-threading and Throttling</LI>
	<LI>Respectful Crawling</LI>
	<LI>Analysis Services</LI>
	<LI>SQL Server 2008 and SSIS</LI>
	<LI>EXIF data extraction</LI>
 </UL>

An open source .NET web crawler written in C# using SQL 2005/2008. Arachnode.net is a complete and comprehensive .NET web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages.

Arachnode.net

Open Search Server is both a modern crawler and search engine and a suite of high-powered full text search algorithms. Built using the best open source technologies like lucene, zkoss, tomcat, poi, tagsoup. Open Search Server is a stable, high-performance piece of software.
 <UL>
	<LI>Multi-languages indexing</LI>
	<LI>The crawlers go through web sites and file systems to rapidly and easily build your index.</LI>
	<LI>Numerous document formats are supported, such as XML, HTML/XHTML, Adobe™ PDF, Microsoft™ Word™, PowerPoint™, OpenOffice™, etc</LI>
	<LI>Quick integration thanks to an XML interface via HTTP queries (XML over HTTP) and PHP classes</LI>
	<LI>The web interface is built around the power offered by the Zkoss framework. It runs with the main Ajax browsers. This RIA-type interface is as comfortable to use as that of a heavy client</LI>
 </UL>

Open Search Server is both a modern crawler and search engine and a suite of high-powered full text search algorithms. Built using the best open source technologies like lucene, zkoss, tomcat, poi, tagsoup. Open Search Server is a stable, high-performance piece of software.

Open Search Server

Grub Next Generation is distributed web crawling system (clients/servers) which helps to build and maintain index of the Web. It is client-server architecture where client crawls the web and updates the server. The peer-to-peer grubclient software crawls during computer idle time.

Grub

Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems. It could crawl and extract information from File system, Websites, Mail boxes and Mail servers. It supports various file formats like Office, PDF, Zip and lot more. Metadata information is extracted from image files. Aperture has a strong focus on semantics, metadata extracted could be mapped to predefined properties.

Aperture - Java framework for getting data and metadata

Yioop is an open source, PHP search engine capable of crawling, index, and providing search results for hundred of millions of pages on relatively low end hardware. It can index a variety of text formats HTML, RSS, PDF, RTF, DOC and images GIF, JPEG, PNG, etc. It can import data from ARC, WARC, Media-Wiki, Open Directory RDF. It is easily localized to many languages. It has built-in support for new feeds, discussion groups, blogs, and wikis. It also supports mixing indexes to create mash ups.

Yioop - Open Source Search Engine Software

Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repositoriy of your choice (e.g. a search engine). It very flexible, powerful, easy to extend, and portable.

Norconex HTTP Collector - Enterprise Web Crawler

Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc. Its main feature include
 <UL>
	<LI>Fetching, parsing and indexation in parallel and distributed</LI>
	<LI>Plugin support</LI>
	<LI>Ontology</LI>
	<LI>Clustering</LI>
	<LI>Distributed filesystem (via Hadoop)</LI>
	<LI>Link-graph database</LI>
	<LI>NTLM authentication</LI>
	<LI>MapReduce</LI>
	<LI>Many formats: plain text, HTML, XML, ZIP, OpenDocument (OpenOffice.org), Microsoft Office (Word, Excel, Powerpoint), PDF, JavaScript, RSS, RTF, MP3 (ID3 tags)</LI>
 </UL>

Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

Nutch - Highly extensible, highly scalable Web crawler

YaCy (read "ya see") is a free distributed search engine, built on principles of peer-to-peer (P2P) networks. It is distributed on several hundred computers so-called YaCy-peers. Each YaCy-peer independently crawls through the Internet, analyzes and indexes found web pages, and stores indexing results in a common database which is shared with other YaCy-peers using principles of P2P networks.

YaCy - Decentralized Web Search

mnoGoSearch for UNIX consists of a command line indexer and a search program which can be run under Apache Web Server, or any other HTTP server supporting CGI interface. mnoGoSearch for Unix is distributed in sources and can be compiled with a number of databases, depending on user's choice. It is known to work on a wide variety of the modern Unix operating systems including Linux, FreeBSD, Mac OSX, Solaris and others.

mnoGoSearch

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix is designed to respect the robots.txt exclusion directives and META robots tags, and collect material at a measured, adaptive pace unlikely to disrupt normal website activity. It provides web interface for operator control and monitoring of crawls. It stores content to ARC or ISO WARC aggregate/transcript format.

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix is designed to respect the robots.txt exclusion directives and META robots tags, and collect material at a measured, adaptive pace unlikely to disrupt normal website activity.

Heritrix

Pavuk is a UNIX program used to mirror the contents of WWW documents or files. It transfers documents from HTTP, FTP, Gopher and optionally from HTTPS (HTTP over SSL) servers. Pavuk has an optional GUI based on the GTK2 widget set. Pavuk is a multifunctional open source web grabber with slow but continous development. Its features include
 <UL>
	<LI>recursive downloading based on links inside HTML documents</LI>
	<LI>transformation of Gopher and FTP directories into HTML document </LI>
	<LI>supports proxy servers (HTTP, FTP, SSL, HTTP gateway for FTP, HTTP gateway for Gopher, SOCKS 4/5) </LI>
	<LI>supports authentication against HTTP servers and proxy HTTP servers </LI>
	<LI>does restart of transfer after program break, link down, timeout or some other error </LI>
	<LI>can be run on a terminal or inside an X windows window</LI>
	<LI>have Native Language Support based on GNU gettext</LI>
	<LI>FTP over SSL</LI>
	<LI>multiple round-robin used HTTP proxies </LI>
 </UL>

Pavuk is a UNIX program used to mirror the contents of WWW documents or files. It transfers documents from HTTP, FTP, Gopher and optionally from HTTPS (HTTP over SSL) servers. Pavuk has an optional GUI based on the GTK2 widget set.

Pavuk

ASPseek is an Internet search engine software developed by SWsoft.ASPseek consists of an indexing robot, a search daemon, and a CGI search frontend. It can index as many as a few million URLs and search for words and phrases, use wildcards, and do a Boolean search. Search results can be limited to time period given, site or Web space (set of sites) and sorted by relevance (PageRank is used) or date. ASPseek is optimized for multiple sites (threaded index, async DNS lookups, grouping results by site, Web spaces), but can be used for searching one site as well. ASPseek can work with multiple languages/encodings at once (including multibyte encodings such as Chinese) due to Unicode storage mode. Other features include stopwords and ispell support, a charset and language guesser, HTML templates for search results, excerpts, and query words highlighting.

ASPseek is an Internet search engine software developed by SWsoft.ASPseek consists of an indexing robot, a search daemon, and a CGI search frontend. It can index as many as a few million URLs and search for words and phrases, use wildcards, and do a Boolean search. Search results can be limited to time period given, site or Web space (set of sites) and sorted by relevance (PageRank is used) or date.

ASPseek

Norconex HTTP Collector is a web spider, or crawler that aims to make Enterprise Search integrators and developers's life easier. It is Portable, Extensible, reusable, Robots.txt support, Obtain and manipulate document metadata, Resumable upon failure and lot more.

Norconex HTTP Collector - A Web Crawler in Java

Discover open source projects across all platforms

Projects