gain - Web crawling framework based on asyncio.

  •        37

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp. You can add proxy setting to spider as above.

https://github.com/gaojiuli/gain

Tags
Implementation
License
Platform

   




Related Projects

ruia - Async Python 3.6+ web scraping micro-framework based on asyncio.

  •    Python

Ruia is an async web scraping micro-framework, written with asyncio and aiohttp, aims to make crawling url as convenient as possible.

aiohttp - Async http client/server framework (asyncio)

  •    Python

For this release we completely refactored low-level implementation of http handling. Finally uvloop gives performance improvement. Overall performance improvement should be around 70-90% compared to 1.x version.We took opportunity to refactor long standing api design problems across whole package. Client exceptions handling has been cleaned up and now much more straight forward. Client payload management simplified and allows to extend with any custom type. Client connection pool implementation has been redesigned as well, now there is no need for actively releasing response objects, aiohttp handles connection release automatically.

uvloop - Ultra fast asyncio event loop.

  •    Python

uvloop is a fast, drop-in replacement of the built-in asyncio event loop. uvloop is implemented in Cython and uses libuv under the hood. The project documentation can be found here. Please also check out the wiki.

gino - GINO Is Not ORM - a Python asyncio ORM on SQLAlchemy core.

  •    Python

GINO - GINO Is Not ORM - is a lightweight asynchronous ORM built on top of SQLAlchemy core for Python asyncio. Now (early 2018) GINO supports only one dialect asyncpg. There are a few tasks in GitHub issues marked as help wanted. Please feel free to take any of them and pull requests are greatly welcome.


Norconex HTTP Collector - A Web Crawler in Java

  •    Java

Norconex HTTP Collector is a web spider, or crawler that aims to make Enterprise Search integrators and developers's life easier. It is Portable, Extensible, reusable, Robots.txt support, Obtain and manipulate document metadata, Resumable upon failure and lot more.

node-crawler - Web Crawler/Spider for NodeJS + server-side jQuery ;-)

  •    Javascript

Web Crawler/Spider for NodeJS + server-side jQuery ;-)

Norconex HTTP Collector - Enterprise Web Crawler

  •    Java

Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repositoriy of your choice (e.g. a search engine). It very flexible, powerful, easy to extend, and portable.

go_spider - [爬虫框架 (golang)] An awesome Go concurrent Crawler(spider) framework

  •    Go

A crawler of vertical communities achieved by GOLANG. Latest stable Release: Version 1.2 (Sep 23, 2014).

colly - Elegant Scraper and Crawler Framework for Golang

  •    Go

Colly provides a clean interface to write any kind of crawler/scraper/spider. With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

uvicorn - The lightning-fast ASGI server. 🦄

  •    Python

The lightning-fast ASGI server. Uvicorn is a lightning-fast ASGI server implementation, using uvloop and httptools.

crawler - A high performance web crawler in Elixir.

  •    Elixir

A high performance web crawler in Elixir, with worker pooling and rate limiting via OPQ. Below is a very high level architecture diagram demonstrating how Crawler works.

Crawler-Detect - 🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

  •    PHP

CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent and http_from header. Currently able to detect 1,000's of bots/spiders/crawlers. Run composer require jaybizzle/crawler-detect 1.* or add "jaybizzle/crawler-detect" :"1.*" to your composer.json.

Monkey-Spider

  •    Python

The Monkey-Spider is a crawler based low-interaction Honeyclient Project. It is not only restricted to this use but it is developed as such. The Monkey-Spider crawles Web sites to expose their threats to Web clients.

dhtspider - Bittorrent dht network spider

  •    Javascript

Bittorrent dht network infohash spider, for engiy.com[a bittorrent resource search engine]

Squzer - Distributed Web Crawler

  •    Python

Squzer is the Declum's open-source, extensible, scale, multithreaded and quality web crawler project entirely written in the Python language.

grab-site - The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

  •    Python

grab-site is an easy preconfigured web crawler designed for backing up websites. Give grab-site a URL and it will recursively crawl the site and write WARC files. Internally, grab-site uses wpull for crawling. a dashboard with all of your crawls, showing which URLs are being grabbed, how many URLs are left in the queue, and more.





We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.