Displaying 1 to 4 from 4 results

huntsman - Super configurable async web spider

  •    Javascript

Huntsman takes one or more 'seed' urls with the spider.queue.add() method. Once the process is kicked off with spider.start(), it will take care of extracting links from the page and following only the pages we want.

node-readability - Scrape/Crawl article from any site automatically

  •    Javascript

In my case, the speed of spider is about 1500k documents per day, and the maximize crawling speed is 1.2k /minute, avg 1k /minute, the memory cost are about 200 MB on each spider kernel, and the accuracy is about 90%, the rest 10% can be fixed by customizing Score Rules or Selectors. it's better than any other readability modules.

spider-detector - A tiny node module to detect spiders/crawlers quickly and comes with optional middleware for ExpressJS

  •    Javascript

It might be useful when you have a single page app but want to deliver static pages for spiders. Well, I wanted one which does not use readFileSync and comes with optional middleware. Furthermore some hackers do not classify Googlebot as a spider anymore which poses a problem sometimes, see next question.

crawlerr - A simple and fully customizable web crawler/spider for Node

  •    Javascript

crawlerr is simple, yet powerful web crawler for Node.js, based on Promises. This tool allows you to crawl specific urls only based on wildcards. It uses Bloom filter for caching. A browser-like feeling. Creates a new Crawlerr instance for a specific website with custom options. All routes will be resolved to base.