Scrapy - Web crawling & scraping framework for Python

  •        1429

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

https://scrapy.org/
https://github.com/scrapy/scrapy

Tags
Implementation
License
Platform

   




Related Projects

scrapy-redis - Redis-based components for Scrapy.


Redis-based components for Scrapy. You can start multiple spider instances that share a single redis queue. Best suitable for broad multi-domain crawls.

Norconex HTTP Collector - Enterprise Web Crawler


Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repositoriy of your choice (e.g. a search engine). It very flexible, powerful, easy to extend, and portable.


scrapy-examples - Multifarious Scrapy examples


Multifarious scrapy examples with integrated proxies and agents, which make you comfy to write a spider. There are several depths in the spider, and the spider gets real data from depth2.

colly - Elegant Scraper and Crawler Framework for Golang


Colly provides a clean interface to write any kind of crawler/scraper/spider. With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

Arachnode.net


An open source .NET web crawler written in C# using SQL 2005/2008. Arachnode.net is a complete and comprehensive .NET web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages.

Norconex HTTP Collector - A Web Crawler in Java


Norconex HTTP Collector is a web spider, or crawler that aims to make Enterprise Search integrators and developers's life easier. It is Portable, Extensible, reusable, Robots.txt support, Obtain and manipulate document metadata, Resumable upon failure and lot more.

colly - Fast and Elegant Scraping Framework for Gophers


Colly provides a clean interface to write any kind of crawler/scraper/spider.With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

Open Search Server


Open Search Server is both a modern crawler and search engine and a suite of high-powered full text search algorithms. Built using the best open source technologies like lucene, zkoss, tomcat, poi, tagsoup. Open Search Server is a stable, high-performance piece of software.

node-crawler - Web Crawler/Spider for NodeJS + server-side jQuery ;-)


Web Crawler/Spider for NodeJS + server-side jQuery ;-)

ants-go - open source, distributed, restful crawler engine in golang


I wrote a crawler engine named ants in python base on scrapy. But sometimes, dynamic language is chaos. So I start to write it in a compile language.

crawler - A high performance web crawler in Elixir.


A high performance web crawler in Elixir, with worker pooling and rate limiting via OPQ. Below is a very high level architecture diagram demonstrating how Crawler works.

Aperture - Java framework for getting data and metadata


Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems. It could crawl and extract information from File system, Websites, Mail boxes and Mail servers. It supports various file formats like Office, PDF, Zip and lot more. Metadata information is extracted from image files. Aperture has a strong focus on semantics, metadata extracted could be mapped to predefined properties.

Squzer - Distributed Web Crawler


Squzer is the Declum's open-source, extensible, scale, multithreaded and quality web crawler project entirely written in the Python language.

grab-site - The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns


grab-site is an easy preconfigured web crawler designed for backing up websites. Give grab-site a URL and it will recursively crawl the site and write WARC files. Internally, grab-site uses wpull for crawling. a dashboard with all of your crawls, showing which URLs are being grabbed, how many URLs are left in the queue, and more.

go_spider - [爬虫框架 (golang)] An awesome Go concurrent Crawler(spider) framework


A crawler of vertical communities achieved by GOLANG. Latest stable Release: Version 1.2 (Sep 23, 2014).