Displaying 1 to 9 from 9 results

Scrapy - Web crawling & scraping framework for Python

  •    Python

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Gigablast - Web and Enterprise search engine in C++

  •    C++

Gigablast is one of the remaining four search engines in the United States that maintains its own searchable index of over a billion pages. It is scalable to thousands of servers. Has scaled to over 12 billion web pages on over 200 servers. It supports Distributed web crawler, Document conversion, Automated data corruption detection and repair, Can cluster results from same site, Synonym search, Spell checker and lot more.

Pavuk

  •    C

Pavuk is a UNIX program used to mirror the contents of WWW documents or files. It transfers documents from HTTP, FTP, Gopher and optionally from HTTPS (HTTP over SSL) servers. Pavuk has an optional GUI based on the GTK2 widget set.




maman - Rust Web Crawler saving pages on Redis

  •    Rust

Maman is a Rust Web Crawler saving pages on Redis. LIMIT must be an integer or 0 is the default, meaning no limit.

gopa - [WIP] GOPA, a spider written in Golang, for Elasticsearch

  •    Go

GOPA, A Spider Written in Go. First of all, get it, two opinions: download the pre-built package or compile it yourself.

awesome-web-scraper - A collection of awesome web scaper, crawler.

  •    

A collection of awesome web scaper, crawler. Please, read the Contribution Guidelines before submitting your suggestion.

supercrawler - A web crawler

  •    Javascript

Supercrawler is a Node.js web crawler. It is designed to be highly configurable and easy to use. When Supercrawler successfully crawls a page (which could be an image, a text document or any other file), it will fire your custom content-type handlers. Define your own custom handlers to parse pages, save data and do anything else you need.


crawlerr - A simple and fully customizable web crawler/spider for Node

  •    Javascript

crawlerr is simple, yet powerful web crawler for Node.js, based on Promises. This tool allows you to crawl specific urls only based on wildcards. It uses Bloom filter for caching. A browser-like feeling. Creates a new Crawlerr instance for a specific website with custom options. All routes will be resolved to base.