proxy_pool - Python爬虫代理IP池(proxy pool)

  •        30

Python爬虫代理IP池(proxy pool)

https://github.com/jhao104/proxy_pool

Tags
Implementation
License
Platform

   




Related Projects

scylla - Intelligent proxy pool for Humans™

  •    Python

对于偏好中文的用户,请阅读 中文文档。For those who prefer to use Chinese, please read the Chinese Documentation. This is an example of running a service locally (localhost), using port 8899.

Pavuk

  •    C

Pavuk is a UNIX program used to mirror the contents of WWW documents or files. It transfers documents from HTTP, FTP, Gopher and optionally from HTTPS (HTTP over SSL) servers. Pavuk has an optional GUI based on the GTK2 widget set.

grab-site - The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

  •    Python

grab-site is an easy preconfigured web crawler designed for backing up websites. Give grab-site a URL and it will recursively crawl the site and write WARC files. Internally, grab-site uses wpull for crawling. a dashboard with all of your crawls, showing which URLs are being grabbed, how many URLs are left in the queue, and more.

xmrig-proxy - Monero (XMR) Stratum protocol proxy

  •    C++

Extremely high performance Monero (XMR) Stratum protocol proxy, can easily handle over 100K connections on cheap $5 (1024 MB) virtual machine. Reduce number of pool connections up to 256 times, 100K workers become just 391 worker on pool side. Written on C++/libuv same as XMRig miner. This proxy designed and created for handle donation traffic from XMRig. No one other solution works fine with high connection/disconnection rate.


gain - Web crawling framework based on asyncio.

  •    Python

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp. You can add proxy setting to spider as above.

go_spider - [爬虫框架 (golang)] An awesome Go concurrent Crawler(spider) framework

  •    Go

A crawler of vertical communities achieved by GOLANG. Latest stable Release: Version 1.2 (Sep 23, 2014).

SQL Relay - Database Connection Pool library with API available in all programming languages

  •    C++

SQL Relay is a persistent database connection pooling, proxying and load balancing system for Unix and Linux supporting ODBC, and all major databases. It has APIs for C, C++, ODBC, Perl, Perl-DBI, Python, Python-DB, Zope, PHP, Ruby, Ruby-DBI, Java, TCL and Erlang, drop-in replacement libraries for MySQL and PostgreSQL clients.

huntsman - Super configurable async web spider

  •    Javascript

Huntsman takes one or more 'seed' urls with the spider.queue.add() method. Once the process is kicked off with spider.start(), it will take care of extracting links from the page and following only the pages we want.

Scrapy - Web crawling & scraping framework for Python

  •    Python

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Photon - Incredibly fast crawler designed for recon.

  •    Python

The extracted information is saved in an organized manner or can be exported as json. Control timeout, delay, add seeds, exclude URLs matching a regex pattern and other cool stuff. The extensive range of options provided by Photon lets you crawl the web exactly the way you want.

coin-hive-stratum - use CoinHive's JavaScript miner on any stratum pool

  •    TypeScript

This proxy allows you to use CoinHive's JavaScript miner on a custom stratum pool. You can mine cryptocurrencies Monero (XMR) and Electroneum (ETN).

rendora - dynamic server-side rendering using headless Chrome to effortlessly solve the SEO problem for modern javascript websites

  •    Go

Rendora can be seen as a reverse HTTP proxy server sitting between your backend server (e.g. Node.js/Express.js, Python/Django, etc...) and potentially your frontend proxy server (e.g. nginx, traefik, apache, etc...) or even directly to the outside world that does actually nothing but transporting requests and responses as they are except when it detects whitelisted requests according to the config. In that case, Rendora instructs a headless Chrome instance to request and render the corresponding page and then return the server-side rendered page back to the client (i.e. the frontend proxy server or the outside world). This simple functionality makes Rendora a powerful dynamic renderer without actually changing anything in both frontend and backend code. Dynamic rendering means that the server provides server-side rendered HTML to web crawlers such as GoogleBot and BingBot and at the same time provides the typical initial HTML to normal users in order to be rendered at the client side. Dynamic rendering is meant to improve SEO for websites written in modern javascript frameworks like React, Vue, Angular, etc...

crawler - An easy to use, powerful crawler implemented in PHP. Can execute Javascript.

  •    PHP

This package provides a class to crawl links on a website. Under the hood Guzzle promises are used to crawl multiple urls concurrently. Because the crawler can execute JavaScript, it can crawl JavaScript rendered sites. Under the hood Chrome and Puppeteer are used to power this feature.

node-readability - Scrape/Crawl article from any site automatically

  •    Javascript

In my case, the speed of spider is about 1500k documents per day, and the maximize crawling speed is 1.2k /minute, avg 1k /minute, the memory cost are about 200 MB on each spider kernel, and the accuracy is about 90%, the rest 10% can be fixed by customizing Score Rules or Selectors. it's better than any other readability modules.

yacy_grid_crawler - Crawler Microservice for the YaCy Grid

  •    Java

The Crawler is a microservices which can be deployed i.e. using Docker. When the Crawler Component is started, it searches for a MCP and connect to it. By default the local host is searched for a MCP but you can configure one yourself. Every loader and parser microservice must read this crawl profile information. Because that information is required many times, we omit a request into the cawler index by adding the crawler profile into each contract of a crawl job in the crawler_pending and loader_pending queue.

Norconex HTTP Collector - A Web Crawler in Java

  •    Java

Norconex HTTP Collector is a web spider, or crawler that aims to make Enterprise Search integrators and developers's life easier. It is Portable, Extensible, reusable, Robots.txt support, Obtain and manipulate document metadata, Resumable upon failure and lot more.

CinaProxy Server in vb6 for win32

  •    VB

win32 proxy server progged in easy visual basic, WinSock-Pool, SOCKS4/5, SMTPRelay, Gateways, DialUp-Manager, HTTP(S)-Proxy, WEB-Server(PHP,XML,Dir), Remote, Shutdown, Port-Mapping, FTP,cryptapi socket,*Web-Adblocker*, nt-service, telnet

fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.

  •    Go

Package fetchbot provides a simple and flexible web crawler that follows the robots.txt policies and crawl delays.The package has a single external dependency, robotstxt. It also integrates code from the iq package.