Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Scrapy - Web crawling & scraping framework for Python

Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems. It could crawl and extract information from File system, Websites, Mail boxes and Mail servers. It supports various file formats like Office, PDF, Zip and lot more. Metadata information is extracted from image files. Aperture has a strong focus on semantics, metadata extracted could be mapped to predefined properties.

Aperture - Java framework for getting data and metadata

Gigablast is one of the remaining four search engines in the United States that maintains its own searchable index of over a billion pages. It is scalable to thousands of servers. Has scaled to over 12 billion web pages on over 200 servers. It supports Distributed web crawler, Document conversion, Automated data corruption detection and repair,  Can cluster results from same site, Synonym search, Spell checker and lot more.  

Gigablast - Web and Enterprise search engine in C++

Pavuk is a UNIX program used to mirror the contents of WWW documents or files. It transfers documents from HTTP, FTP, Gopher and optionally from HTTPS (HTTP over SSL) servers. Pavuk has an optional GUI based on the GTK2 widget set. Pavuk is a multifunctional open source web grabber with slow but continous development. Its features include
  <UL>
	<LI>recursive downloading based on links inside HTML documents</LI>
	<LI>transformation of Gopher and FTP directories into HTML document </LI>
	<LI>supports proxy servers (HTTP, FTP, SSL, HTTP gateway for FTP, HTTP gateway for Gopher, SOCKS 4/5) </LI>
	<LI>supports authentication against HTTP servers and proxy HTTP servers </LI>
	<LI>does restart of transfer after program break, link down, timeout or some other error </LI>
	<LI>can be run on a terminal or inside an X windows window</LI>
	<LI>have Native Language Support based on GNU gettext</LI>
	<LI>FTP over SSL</LI>
	<LI>multiple round-robin used HTTP proxies </LI>
 </UL>

Pavuk is a UNIX program used to mirror the contents of WWW documents or files. It transfers documents from HTTP, FTP, Gopher and optionally from HTTPS (HTTP over SSL) servers. Pavuk has an optional GUI based on the GTK2 widget set.

Pavuk

Norconex HTTP Collector is a web spider, or crawler that aims to make Enterprise Search integrators and developers's life easier. It is Portable, Extensible, reusable, Robots.txt support, Obtain and manipulate document metadata, Resumable upon failure and lot more.

Norconex HTTP Collector - A Web Crawler in Java

soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSoup. 

Soup - Web Scraper in Go, similar to BeautifulSoup

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. StormCrawler is a library and collection of resources that developers can leverage to build their own crawlers. The good news is that doing so can be pretty straightforward. Often, all you'll have to do will be to declare StormCrawler as a Maven dependency, write your own Topology class (tip : you can extend ConfigurableTopology), reuse the components provided by the project and maybe write a couple of custom ones for your own secret sauce. 

Storm Crawler - Web crawler SDK based on Apache Storm 

Discover open source projects across all platforms

Projects

Scrapy - Web crawling & scraping framework for Python

Aperture - Java framework for getting data and metadata

Gigablast - Web and Enterprise search engine in C++

Pavuk

Norconex HTTP Collector - A Web Crawler in Java

Soup - Web Scraper in Go, similar to BeautifulSoup

Storm Crawler - Web crawler SDK based on Apache Storm

TechStack

Tagcloud

License

Suggested keywords:

Projects

Scrapy - Web crawling & scraping framework for Python

Aperture - Java framework for getting data and metadata

Gigablast - Web and Enterprise search engine in C++

Pavuk

Norconex HTTP Collector - A Web Crawler in Java

Soup - Web Scraper in Go, similar to BeautifulSoup

Storm Crawler - Web crawler SDK based on Apache Storm

TechStack

Tagcloud

License