Displaying 1 to 20 from 23 results

Scrapy - Web crawling & scraping framework for Python

  •    Python

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Aperture - Java framework for getting data and metadata

  •    Java

Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems. It could crawl and extract information from File system, Websites, Mail boxes and Mail servers. It supports various file formats like Office, PDF, Zip and lot more. Metadata information is extracted from image files. Aperture has a strong focus on semantics, metadata extracted could be mapped to predefined properties.

Gigablast - Web and Enterprise search engine in C++

  •    C++

Gigablast is one of the remaining four search engines in the United States that maintains its own searchable index of over a billion pages. It is scalable to thousands of servers. Has scaled to over 12 billion web pages on over 200 servers. It supports Distributed web crawler, Document conversion, Automated data corruption detection and repair, Can cluster results from same site, Synonym search, Spell checker and lot more.




Pavuk

  •    C

Pavuk is a UNIX program used to mirror the contents of WWW documents or files. It transfers documents from HTTP, FTP, Gopher and optionally from HTTPS (HTTP over SSL) servers. Pavuk has an optional GUI based on the GTK2 widget set.

Norconex HTTP Collector - A Web Crawler in Java

  •    Java

Norconex HTTP Collector is a web spider, or crawler that aims to make Enterprise Search integrators and developers's life easier. It is Portable, Extensible, reusable, Robots.txt support, Obtain and manipulate document metadata, Resumable upon failure and lot more.

Storm Crawler - Web crawler SDK based on Apache Storm

  •    Java

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. StormCrawler is a library and collection of resources that developers can leverage to build their own crawlers. The good news is that doing so can be pretty straightforward. Often, all you'll have to do will be to declare StormCrawler as a Maven dependency, write your own Topology class (tip : you can extend ConfigurableTopology), reuse the components provided by the project and maybe write a couple of custom ones for your own secret sauce.

Soup - Web Scraper in Go, similar to BeautifulSoup

  •    Go

soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSoup.


NCrawler

  •    DotNet

Simple and very efficient multithreaded web crawler with pipeline based processing written in C#. Contains HTML, Text, PDF, and IFilter document processors and language detection(Google). Easy to add pipeline steps to extract, use and alter information.

GrabberX

  •    

GrabberX is a site-mirroring tool. It is used to deal with form/cookie sealed websites, javascript generated links, and so on. The goal is not performance, but a handy tool that can help the crawl of other enterprise search engines.

get-image-urls - Scrape image urls from HTML website including CSS background images.

  •    Javascript

Scrape image urls from a HTML website. It's using PhantomJS in the background to get all images including CSS backgrounds.

scrapy-bench - A CLI for benchmarking Scrapy.

  •    Python

A command-line interface for benchmarking Scrapy, that reflects real-world usage. Firstly, download the static snapshot of the website Books to Scrape. That can be done by using wget.

algolia-webcrawler - Simple node worker that crawls sitemaps in order to keep an algolia index up-to-date

  •    Javascript

Simple node worker that crawls sitemaps in order to keep an Algolia index up-to-date. It uses simple CSS selectors in order to find the actual text content to index.

WebCrawler - Just a simple web crawler which return crawled links as IObservable using reactive extension and async await

  •    CSharp

Just a simple web crawler which return crawled links as IObservable using reactive extension and async await.

js-crawler - Web crawler for Node.JS

  •    TypeScript

Web crawler for Node.JS, both HTTP and HTTPS are supported. The call to configure is optional, if it is omitted the default option values will be used.

ChiChew - :notebook_with_decorative_cover: 教育部《重編國語辭典修訂本》 網路爬蟲 :: A live web crawler for the Chinese-Chinese dictionary published by the Ministry of Education in Taiwan

  •    Python

教育部《重編國語辭典修訂本》 網路爬蟲 (即時資料查詢) A live web crawler for the Chinese-Chinese dictionary published by the Ministry of Education in Taiwan.

maman - Rust Web Crawler saving pages on Redis

  •    Rust

Maman is a Rust Web Crawler saving pages on Redis. LIMIT must be an integer or 0 is the default, meaning no limit.

validate-website - Web crawler for checking the validity of your documents.

  •    HTML

validate-website is a web crawler for checking the markup validity with XML Schema / DTD and not found urls (more info doc/validate-website.adoc). validate-website-static checks the markup validity of your local documents with XML Schema / DTD (more info doc/validate-website-static.adoc).

gopa - [WIP] GOPA, a spider written in Golang, for Elasticsearch

  •    Go

GOPA, A Spider Written in Go. First of all, get it, two opinions: download the pre-built package or compile it yourself.

awesome-web-scraper - A collection of awesome web scaper, crawler.

  •    

A collection of awesome web scaper, crawler. Please, read the Contribution Guidelines before submitting your suggestion.