Displaying 1 to 20 from 34 results

Photon - Incredibly fast crawler designed for recon.

  •    Python

The extracted information is saved in an organized manner or can be exported as json. Control timeout, delay, add seeds, exclude URLs matching a regex pattern and other cool stuff. The extensive range of options provided by Photon lets you crawl the web exactly the way you want.

crawler - A high performance web crawler in Elixir.

  •    Elixir

A high performance web crawler in Elixir, with worker pooling and rate limiting via OPQ. Below is a very high level architecture diagram demonstrating how Crawler works.

grab-site - The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

  •    Python

grab-site is an easy preconfigured web crawler designed for backing up websites. Give grab-site a URL and it will recursively crawl the site and write WARC files. Internally, grab-site uses wpull for crawling. a dashboard with all of your crawls, showing which URLs are being grabbed, how many URLs are left in the queue, and more.




dhtspider - Bittorrent dht network spider

  •    Javascript

Bittorrent dht network infohash spider, for engiy.com[a bittorrent resource search engine]

spider2 - A 2nd generation spider to crawl any article site, automatic read title and article.

  •    Javascript

A 2nd generation spider to crawl any article site, automatic reading title and content.In my case, the speed of spider is about 700 thousands documents per day, 22 million per month, and the maximize crawling speed is 450 per minute, avg 80 per minute, the memory cost are about 200 megabytes on each spider kernel, and the accuracy is about 90%, the rest 10% can be fixed by customizing Score Rules or Selectors. it's better than any other readability modules.


spider.py - [Reference Only] An asynchronous, multiprocessed, python based spider framework.

  •    Python

An asynchronous, multiprocessed, python spider framework. The spider is seperated into two parts, the actuall engine and the extractors. The engine submits the requests, and handles all of the processes and connections. The extractors are functions that are registered to be called after a page has been loaded and parsed.

robotstxt - robots.txt file parsing and checking for R

  •    R

Provides functions to download and parse ‘robots.txt’ files. Ultimately the package makes it easy to check if bots (spiders, crawler, scrapers, …) are allowed to access specific resources on a domain.

scrap.js - Scrapping tool for node.js

  •    Javascript

This example makes use of jQuery to traverse the page, and shows how to download binary files. This example shows how to download the page as string and use regular expressions with jsMatch to extract meaningful parts.

hacker-news-digest - :newspaper: A responsive interface of Hacker News with summaries and illustrations

  •    Python

This service extracts summaries and illustrations from hacker news articles for people who want to get the most out of hacker news while cutting down the time spent on deciding which one to read and which to skip.

domain_hunter - A Burp Suite Extender that search sub domain and similar domain from sitemap,get related domains from certification

  •    Java

A Burp Suite extender that search sub domains,similar domains and related domains from sitemap. Some times similar domain and related domains give you surprise^_^. that's why I care about it. 2017-07-28: Add a function to crawl all known subdomains; fix some bug.

input-field-finder - Spiders given URLs for input fields.

  •    Go

Spiders the domain of a single URL or a set or URLs and prints out all <input> elements found on the given domain and scheme (http/https). Input fields are the most common vector/sink for web application vulnerabilities. I wrote this tool to help automate the reconnaissance phase when testing web applications for security vulnerabilities.

scrala - :whale: :coffee: :spider: Scala crawler(spider) framework, inspired by scrapy.

  •    Scala

scrala is a web crawling framework for scala, which is inspired by scrapy. You will get the jar in ./target/scala-<version>/.

jd_product_spider - 京东商品爬虫服务

  •    Python

缺少一个图,后期补上... 京东所有的品类都是三层.

spoon - A package for building specific Proxy Pool for different Sites.

  •    Python

Spoon is a library for building Distributed Proxy Pool for each different sites as you assign. Only running on python 3. Simply run: pip install spoonproxy or clone the repo and set it into your PYTHONPATH.