Displaying 1 to 20 from 73 results

headless-chrome-crawler - Distributed crawler powered by Headless Chrome

  •    Javascript

Crawlers based on simple requests to HTML files are generally fast. However, it sometimes ends up capturing empty bodies, especially when the websites are built on such modern frontend frameworks as AngularJS, React and Vue.js. Note: headless-chrome-crawler contains Puppeteer. During installation, it automatically downloads a recent version of Chromium. To skip the download, see Environment variables.

huntsman - Super configurable async web spider

  •    Javascript

Huntsman takes one or more 'seed' urls with the spider.queue.add() method. Once the process is kicked off with spider.start(), it will take care of extracting links from the page and following only the pages we want.

node-Tor - Javascript implementation of the Tor (or Tor like) anonymizer project (The Onion Router)

  •    Javascript

For a quick look, see the demo video on Peersm, download and stream anonymously inside your browser, serverless anonynous P2P network compatible with torrents. Check out torrent-live for a more general presentation and to get the dynamic blocklist.




Revenant - A high level PhantomJS headless browser in Node.js ideal for task automation

  •    Javascript

A headless browser powered by PhantomJS functions in Node.js. Based on the PhantomJS-Node bridge.This library aims to abstract many of the simple functions one would use while testing or scraping a web page. Instead of running page.evaluate(...) and entering the javascript functions for a task, these tasks are abstracted for the user.

dhtspider - Bittorrent dht network spider

  •    Javascript

Bittorrent dht network infohash spider, for engiy.com[a bittorrent resource search engine]

lightcrawler - Crawl a website and run it through Google lighthouse

  •    Javascript

Crawl a website and run it through Google lighthouse

ghcrawler - Crawl GitHub APIs and store the discovered orgs, repos, commits, ...

  •    Javascript

A robust GitHub API crawler that walks a queue of GitHub entities retrieving and storing their contents.


norch-fetch - Fetch pure HTML from a webserver and save it to disk

  •    Javascript

Fetch pure HTML from a webserver and save it to disk

bjut_crawler - BJUT secrets~

  •    Javascript

crawl student information from bjut

recrawler - Remote web content crawler done right.

  •    Javascript

Remote web content crawler done right.Sometimes I want to grab some nice images from a url like http://bbs.005.tv/thread-492392-1-1.html, so I made this little program to combine node-fetch and cheerio to make my attempt fulfilled.

taki - Take a snapshot of any website.

  •    Javascript

Built on the top of Google's Puppeteer, for a jsdom/chromy version please visit here.Then call window.iamready() instead of window.snapshot() in your app.

jDistiller - A page scraping DSL for extracting structured information from unstructured XHTML, built on Node

  •    Javascript

Over my past couple years in the industry, there have been several times where I need to scrape structured information from (relatively) unstructured XHTML websites.A closure can optionally be provided as the third parameter for the set() method.

routers-news - A crawler for various popular tech news sources

  •    Javascript

Routers is a collection of web-crawlers for various popular technology news sources.It exposes a command-line interface to these crawlers, allowing for the distinguishing tech-news enthusiast to avoid leaving the comfort of their terminal.

wikifetch - Uses jQuery to return a structured JSON representation of a Wikipedia article.

  •    Javascript

For some NLP research I'm currently doing, I was interested in parsing structured information from Wikipedia articles.

PainlessCrawler - (WIP) A painless Node.js web crawler that simply works

  •    Javascript

Working with some of the top Node.js crawlers on GitHub have led to much frustration with regards to getting something which simply works, and I ended up spending many hours playing around and figuring out how they were supposed to work.As such, I wanted to build a first and foremost painless crawler that just works with clear documetation, which allows the user to focus on more important tasks.

rippled-network-crawler - Crawls all nodes in rippled network

  •    Javascript

This crawls the ripple network, via making requests to the /crawl endpoint of each peer it can connect to, starting from an entry point. Some peers may know, and publish (perhaps errantly .. ), the ip associated with a peer, while others don't. We merge the points of view of each peer, collecting a dict of data, keyed by ip address.This maps out the connections between all rippled servers (not necessarily UNLS) who (for the most part) don't even participate in Consensus or at least don't have any say in influencing the outcome of a transaction on mainnet.