Displaying 1 to 18 from 18 results

headless-chrome-crawler - Distributed crawler powered by Headless Chrome

  •    Javascript

Crawlers based on simple requests to HTML files are generally fast. However, it sometimes ends up capturing empty bodies, especially when the websites are built on such modern frontend frameworks as AngularJS, React and Vue.js. Note: headless-chrome-crawler contains Puppeteer. During installation, it automatically downloads a recent version of Chromium. To skip the download, see Environment variables.

huntsman - Super configurable async web spider

  •    Javascript

Huntsman takes one or more 'seed' urls with the spider.queue.add() method. Once the process is kicked off with spider.start(), it will take care of extracting links from the page and following only the pages we want.

node-Tor - Javascript implementation of the Tor (or Tor like) anonymizer project (The Onion Router)

  •    Javascript

For a quick look, see the demo video on Peersm, download and stream anonymously inside your browser, serverless anonynous P2P network compatible with torrents. Check out torrent-live for a more general presentation and to get the dynamic blocklist.

norch-fetch - Fetch pure HTML from a webserver and save it to disk

  •    Javascript

Fetch pure HTML from a webserver and save it to disk

recrawler - Remote web content crawler done right.

  •    Javascript

Remote web content crawler done right.Sometimes I want to grab some nice images from a url like http://bbs.005.tv/thread-492392-1-1.html, so I made this little program to combine node-fetch and cheerio to make my attempt fulfilled.

iranian-calendar-events - Fetch Iranian calendar events (Jalali, Hijri and Gregorian) from time

  •    Javascript

A simple package to fetch Iranian calendar events (Jalali, Hijri, Gregorian) from time.ir website.Remember to write a few tests for your code before sending pull requests.

npm-search - 🗿 npm ↔️ Algolia replication tool :skier: :snail: :artificial_satellite:

  •    Javascript

npm ↔️ Algolia replication tool. This is a failure resilient npm registry to Algolia index replication process. It will replicate all npm packages to an Algolia index and keep it up to date.

smeagol - NodeJS crawler

  •    Javascript

Smeagol is a very simple NodeJS crawler module where you can create url patterns to extract different contents from different pages. "pattern_url" define what pages Smeagol will scrap. "id" is the identification for the result group in Smeagol results. "each_item" is a CSS selector. Smeagol will iterate this selector on the page and extract the data defined in "find". "find" is a object with label and CSS selector for each information you want to get from each "each_item".

grunt-link-checker - Run node-simple-crawler to discover broken links on your website

  •    Javascript

Run node-simple-crawler to discover broken links on your website. grunt-link-checker will by default find any broken internal links on the given site and will also find broken fragment identifiers by using cheerio to ensure that an element exists with the given identifier. You can figure more options that are available via node-simplecrawler.

snapshooter - Simple crawler for Single Page Applications

  •    CoffeeScript

Simple crawler for Single Page Applications. Snapshooter will load a URL, wait the javascript to render and save it as plain HTML.

salmonjs - [WIP] Web Crawler in Node.js to spider dynamically whole websites.

  •    Javascript

Web Crawler in Node.js to spider dynamically whole websites. It helps you to map / process entire websites, spidering them and parsing each page in a smart way. It follows all the links and test several times the form objects. In this way is possible to check effectively the whole website.

node-bot - Fast and real-time extraction of web pages information (html, text, etc) using node-dom based on given criterias (example : retrieves real-time the price of a product)

  •    Javascript

Real-time extraction of web pages information (html, text, etc) based on given criterias. It can be used as a server or an API, then parameters are passed in the URL, or directly as an independant node.js module.

puppeteer-fetchbot - Library and Shell command that provides a simple JSON-API to perform human like interactions and data extractions on any website

  •    TypeScript

FetchBot is a library and shell command that provides a simple JSON-API to perform human like interactions and data extractions on any website and was built on top of puppeteer. FetchBot has an "event listener like" system that turns your browser into a bot who knows what to do when the url changes. The "event" is an url/regex and it's configuration is executed, once the url/pattern matches the currently opened one. Now on it's up to you to configure a friendly bot or a crazy zombie.

spider-detector - A tiny node module to detect spiders/crawlers quickly and comes with optional middleware for ExpressJS

  •    Javascript

It might be useful when you have a single page app but want to deliver static pages for spiders. Well, I wanted one which does not use readFileSync and comes with optional middleware. Furthermore some hackers do not classify Googlebot as a spider anymore which poses a problem sometimes, see next question.

npm-license-crawler - Analyzes license information for multiple node

  •    Javascript

NPM License Crawler is a wrapper around license-checker to analyze several node packages (package.json files) as part of your software project. This way, it is possible to create a list of third party licenses for your software project in one go. File paths containing ".git" or "node_modules" are ignored at the stage where 'package.json' files are matched to provide the entry points to calling license-checker. If you like npm-license-crawler, please consider ★ starring the project on github. Contributions to the project are welcome. You can simply fork the project and create a pull request with your contribution to start with.

web-auto-extractor - Automatically extracts structured information from webpages

  •    Javascript

Parse semantically structured information from any HTML webpage. Popularly, many websites mark up their webpages with Schema.org vocabularies for better SEO. This library helps you parse that information to JSON.

crawlerr - A simple and fully customizable web crawler/spider for Node

  •    Javascript

crawlerr is simple, yet powerful web crawler for Node.js, based on Promises. This tool allows you to crawl specific urls only based on wildcards. It uses Bloom filter for caching. A browser-like feeling. Creates a new Crawlerr instance for a specific website with custom options. All routes will be resolved to base.