Displaying 1 to 7 from 7 results

huntsman - Super configurable async web spider

  •    Javascript

Huntsman takes one or more 'seed' urls with the spider.queue.add() method. Once the process is kicked off with spider.start(), it will take care of extracting links from the page and following only the pages we want.

node-readability - Scrape/Crawl article from any site automatically

  •    Javascript

In my case, the speed of spider is about 1500k documents per day, and the maximize crawling speed is 1.2k /minute, avg 1k /minute, the memory cost are about 200 MB on each spider kernel, and the accuracy is about 90%, the rest 10% can be fixed by customizing Score Rules or Selectors. it's better than any other readability modules.

grab-site - The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

  •    Python

grab-site is an easy preconfigured web crawler designed for backing up websites. Give grab-site a URL and it will recursively crawl the site and write WARC files. Internally, grab-site uses wpull for crawling. a dashboard with all of your crawls, showing which URLs are being grabbed, how many URLs are left in the queue, and more.




snap - Creates a static snapshot of a website. Sort of like wget's mirror mode, but with nice urls

  •    Javascript

Like wget -r <url> but specifically designed to support "pretty" URLs. With wget, a URL pointing to /foo would result in /foo.html, but this means the URL has now changed.With snap, it will create the directory /foo and save the file to /foo/index.html so that the URL /foo still works.

spider2 - A 2nd generation spider to crawl any article site, automatic read title and article.

  •    Javascript

A 2nd generation spider to crawl any article site, automatic reading title and content.In my case, the speed of spider is about 700 thousands documents per day, 22 million per month, and the maximize crawling speed is 450 per minute, avg 80 per minute, the memory cost are about 200 megabytes on each spider kernel, and the accuracy is about 90%, the rest 10% can be fixed by customizing Score Rules or Selectors. it's better than any other readability modules.