ruia - Async Python 3.6+ web scraping micro-framework based on asyncio.

  •        5

Ruia is an async web scraping micro-framework, written with asyncio and aiohttp, aims to make crawling url as convenient as possible.

https://docs.python-ruia.org/
https://github.com/howie6879/ruia

Tags
Implementation
License
Platform

   




Related Projects

gain - Web crawling framework based on asyncio.

  •    Python

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp. You can add proxy setting to spider as above.

aiohttp - Async http client/server framework (asyncio)

  •    Python

For this release we completely refactored low-level implementation of http handling. Finally uvloop gives performance improvement. Overall performance improvement should be around 70-90% compared to 1.x version.We took opportunity to refactor long standing api design problems across whole package. Client exceptions handling has been cleaned up and now much more straight forward. Client payload management simplified and allows to extend with any custom type. Client connection pool implementation has been redesigned as well, now there is no need for actively releasing response objects, aiohttp handles connection release automatically.

uvloop - Ultra fast asyncio event loop.

  •    Python

uvloop is a fast, drop-in replacement of the built-in asyncio event loop. uvloop is implemented in Cython and uses libuv under the hood. The project documentation can be found here. Please also check out the wiki.

colly - Elegant Scraper and Crawler Framework for Golang

  •    Go

Colly provides a clean interface to write any kind of crawler/scraper/spider. With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.


Scrapy - Web crawling & scraping framework for Python

  •    Python

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

grab - Web Scraping Framework

  •    Python

Project Grab is not abandoned but it is not being actively developed. At current time I am working on another crawling framework which I want to be simple, fast and does not leak memory. New project is located here: https://github.com/lorien/crawler First, I've tried to use mix of asyncio (network) and classic threads (parsing HTML with lxml on multiple CPU cores) but then I've decided to use classic threads for everything for the sake of simplicity. Network requests are processed with pycurl because it is fast, feature-rich and supports socks5 proxies. You can try new framework but be aware it does not have many features yet. In particular, its options to configure network requests are very pure. If you need some option, feel free to create new issue.

colly - Fast and Elegant Scraping Framework for Gophers

  •    Go

Colly provides a clean interface to write any kind of crawler/scraper/spider.With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

gino - GINO Is Not ORM - a Python asyncio ORM on SQLAlchemy core.

  •    Python

GINO - GINO Is Not ORM - is a lightweight asynchronous ORM built on top of SQLAlchemy core for Python asyncio. Now (early 2018) GINO supports only one dialect asyncpg. There are a few tasks in GitHub issues marked as help wanted. Please feel free to take any of them and pull requests are greatly welcome.

go_spider - [爬虫框架 (golang)] An awesome Go concurrent Crawler(spider) framework

  •    Go

A crawler of vertical communities achieved by GOLANG. Latest stable Release: Version 1.2 (Sep 23, 2014).

grab-site - The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

  •    Python

grab-site is an easy preconfigured web crawler designed for backing up websites. Give grab-site a URL and it will recursively crawl the site and write WARC files. Internally, grab-site uses wpull for crawling. a dashboard with all of your crawls, showing which URLs are being grabbed, how many URLs are left in the queue, and more.

dht - BitTorrent DHT Protocol && DHT Spider.

  •    Go

See the video on the Youtube.It contains two modes, the standard mode and the crawling mode. The standard mode follows the BEPs, and you can use it as a standard dht server. The crawling mode aims to crawl as more metadata info as possiple. It doesn't follow the standard BEPs protocol. With the crawling mode, you can build another BTDigg.

asyncpg - A fast PostgreSQL Database Client Library for Python/asyncio.

  •    Python

asyncpg is a database interface library designed specifically for PostgreSQL and Python/asyncio. asyncpg is an efficient, clean implementation of PostgreSQL server binary protocol for use with Python's asyncio framework. You can read more about asyncpg in an introductory blog post. asyncpg requires Python 3.5 or later and is supported for PostgreSQL versions 9.2 to 10.

trio - Trio – Pythonic async I/O for humans and snake people 🐍

  •    Python

The Trio project's goal is to produce a production-quality, permissively licensed, async/await-native I/O library for Python. Like all async libraries, its main purpose is to help you write programs that do multiple things at the same time with parallelized I/O. A web spider that wants to fetch lots of pages in parallel, a web server that needs to juggle lots of downloads and websocket connections at the same time, a process supervisor monitoring multiple subprocesses... that sort of thing. Compared to other libraries, Trio attempts to distinguish itself with an obsessive focus on usability and correctness. Concurrency is complicated; we try to make it easy to get things right. Trio was built from the ground up to take advantage of the latest Python features, and draws inspiration from many sources, in particular Dave Beazley's Curio. The resulting design is radically simpler than older competitors like asyncio and Twisted, yet just as capable. Trio is the Python I/O library I always wanted; I find it makes building I/O-oriented programs easier, less error-prone, and just plain more fun. Perhaps you'll find the same.

Kyoukai - [OLD] A fully async web framework for Python3.5+ using asyncio

  •    Python

This project will not be getting any more updates. Kyōkai is a fast asynchronous Python server-side web framework. It is built upon asyncio and the Asphalt framework for an extremely fast web server.

Norconex HTTP Collector - A Web Crawler in Java

  •    Java

Norconex HTTP Collector is a web spider, or crawler that aims to make Enterprise Search integrators and developers's life easier. It is Portable, Extensible, reusable, Robots.txt support, Obtain and manipulate document metadata, Resumable upon failure and lot more.

node-readability - Scrape/Crawl article from any site automatically

  •    Javascript

In my case, the speed of spider is about 1500k documents per day, and the maximize crawling speed is 1.2k /minute, avg 1k /minute, the memory cost are about 200 MB on each spider kernel, and the accuracy is about 90%, the rest 10% can be fixed by customizing Score Rules or Selectors. it's better than any other readability modules.

awesome-asyncio - A curated list of awesome Python asyncio frameworks, libraries, software and resources

  •    

A carefully curated list of awesome Python asyncio frameworks, libraries, software and resources. The Python asyncio module introduced to the standard library with Python 3.4 provides infrastructure for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives.