Displaying 1 to 20 from 69 results

gocrawl - Polite, slim and concurrent web crawler.

  •    Go

gocrawl is a polite, slim and concurrent web crawler written in Go.For a simpler yet more flexible web crawler written in a more idiomatic Go style, you may want to take a look at fetchbot, a package that builds on the experience of gocrawl.

arachni - Web Application Security Scanner Framework

  •    Ruby

Arachni is a feature-full, modular, high-performance Ruby framework aimed towards helping penetration testers and administrators evaluate the security of web applications. It is smart, it trains itself by monitoring and learning from the web application's behavior during the scan process and is able to perform meta-analysis using a number of factors in order to correctly assess the trustworthiness of results and intelligently identify (or avoid) false-positives.

scrapy-redis - Redis-based components for Scrapy.

  •    Python

Redis-based components for Scrapy. You can start multiple spider instances that share a single redis queue. Best suitable for broad multi-domain crawls.

Photon - Incredibly fast crawler designed for recon.

  •    Python

The extracted information is saved in an organized manner or can be exported as json. Control timeout, delay, add seeds, exclude URLs matching a regex pattern and other cool stuff. The extensive range of options provided by Photon lets you crawl the web exactly the way you want.

jd-autobuy - Python爬虫,京东自动登录,在线抢购商品

  •    Python

代码仅供学习之用,京东网页不断变化,代码并不一定总是能正常运行。 如果您发现有Bug,Welcome to Pull Request.

fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.

  •    Go

Package fetchbot provides a simple and flexible web crawler that follows the robots.txt policies and crawl delays.The package has a single external dependency, robotstxt. It also integrates code from the iq package.

AppCrawler - 基于appium的app自动遍历工具

  •    Scala

一个基于自动遍历的app爬虫工具. 支持android和iOS, 支持真机和模拟器. 最大的特点是灵活性. 可通过配置来设定遍历的规则.

mzitu - 👧 美女写真套图爬虫(二)

  •    Python

👧 美女写真套图爬虫(二)

crawler - A high performance web crawler in Elixir.

  •    Elixir

A high performance web crawler in Elixir, with worker pooling and rate limiting via OPQ. Below is a very high level architecture diagram demonstrating how Crawler works.

grab-site - The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

  •    Python

grab-site is an easy preconfigured web crawler designed for backing up websites. Give grab-site a URL and it will recursively crawl the site and write WARC files. Internally, grab-site uses wpull for crawling. a dashboard with all of your crawls, showing which URLs are being grabbed, how many URLs are left in the queue, and more.

dhtspider - Bittorrent dht network spider

  •    Javascript

Bittorrent dht network infohash spider, for engiy.com[a bittorrent resource search engine]

jDistiller - A page scraping DSL for extracting structured information from unstructured XHTML, built on Node

  •    Javascript

Over my past couple years in the industry, there have been several times where I need to scrape structured information from (relatively) unstructured XHTML websites.A closure can optionally be provided as the third parameter for the set() method.

routers-news - A crawler for various popular tech news sources

  •    Javascript

Routers is a collection of web-crawlers for various popular technology news sources.It exposes a command-line interface to these crawlers, allowing for the distinguishing tech-news enthusiast to avoid leaving the comfort of their terminal.

wikifetch - Uses jQuery to return a structured JSON representation of a Wikipedia article.

  •    Javascript

For some NLP research I'm currently doing, I was interested in parsing structured information from Wikipedia articles.

rippled-network-crawler - Crawls all nodes in rippled network

  •    Javascript

This crawls the ripple network, via making requests to the /crawl endpoint of each peer it can connect to, starting from an entry point. Some peers may know, and publish (perhaps errantly .. ), the ip associated with a peer, while others don't. We merge the points of view of each peer, collecting a dict of data, keyed by ip address.This maps out the connections between all rippled servers (not necessarily UNLS) who (for the most part) don't even participate in Consensus or at least don't have any say in influencing the outcome of a transaction on mainnet.

ebedke - crawl pages to check what is for lunch today

  •    Python

There are two hard things in computer science, cache invalidation and deciding where to eat. Ebédke is Flask frontend and a web crawler that collects the daily menu from pages of restaurants. The collected menus are cached in redis and can be view in an HTML page or in JSON format.