crawler - An easy to use, powerful crawler implemented in PHP. Can execute Javascript.

  •        35

This package provides a class to crawl links on a website. Under the hood Guzzle promises are used to crawl multiple urls concurrently. Because the crawler can execute JavaScript, it can crawl JavaScript rendered sites. Under the hood Chrome and Puppeteer are used to power this feature.

https://murze.be/2015/11/building-a-crawler-in-php/
https://github.com/spatie/crawler

Tags
Implementation
License
Platform

   




Related Projects

crawler - A high performance web crawler in Elixir.

  •    Elixir

A high performance web crawler in Elixir, with worker pooling and rate limiting via OPQ. Below is a very high level architecture diagram demonstrating how Crawler works.

webmagic - A scalable web crawler framework for Java.

  •    Java

A crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simply the development of a specific crawler.

gocrawl - Polite, slim and concurrent web crawler.

  •    Go

gocrawl is a polite, slim and concurrent web crawler written in Go.For a simpler yet more flexible web crawler written in a more idiomatic Go style, you may want to take a look at fetchbot, a package that builds on the experience of gocrawl.

Norconex HTTP Collector - A Web Crawler in Java

  •    Java

Norconex HTTP Collector is a web spider, or crawler that aims to make Enterprise Search integrators and developers's life easier. It is Portable, Extensible, reusable, Robots.txt support, Obtain and manipulate document metadata, Resumable upon failure and lot more.


yacy_grid_crawler - Crawler Microservice for the YaCy Grid

  •    Java

The Crawler is a microservices which can be deployed i.e. using Docker. When the Crawler Component is started, it searches for a MCP and connect to it. By default the local host is searched for a MCP but you can configure one yourself. Every loader and parser microservice must read this crawl profile information. Because that information is required many times, we omit a request into the cawler index by adding the crawler profile into each contract of a crawl job in the crawler_pending and loader_pending queue.

Crawler-Detect - 🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

  •    PHP

CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent and http_from header. Currently able to detect 1,000's of bots/spiders/crawlers. Run composer require jaybizzle/crawler-detect 1.* or add "jaybizzle/crawler-detect" :"1.*" to your composer.json.

headless-chrome-crawler - Distributed crawler powered by Headless Chrome

  •    Javascript

Crawlers based on simple requests to HTML files are generally fast. However, it sometimes ends up capturing empty bodies, especially when the websites are built on such modern frontend frameworks as AngularJS, React and Vue.js. Note: headless-chrome-crawler contains Puppeteer. During installation, it automatically downloads a recent version of Chromium. To skip the download, see Environment variables.

Storm Crawler - Web crawler SDK based on Apache Storm

  •    Java

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. StormCrawler is a library and collection of resources that developers can leverage to build their own crawlers. The good news is that doing so can be pretty straightforward. Often, all you'll have to do will be to declare StormCrawler as a Maven dependency, write your own Topology class (tip : you can extend ConfigurableTopology), reuse the components provided by the project and maybe write a couple of custom ones for your own secret sauce.

fscrawler - Elasticsearch File System Crawler (FS Crawler)

  •    Java

FS Crawler offers a simple way to index binary files into elasticsearch.

commoncrawl-crawler - The CommonCrawl Crawler Engine and Related MapReduce code

  •    Java

The CommonCrawl Crawler Engine and Related MapReduce code

node-crawler - Web Crawler/Spider for NodeJS + server-side jQuery ;-)

  •    Javascript

Web Crawler/Spider for NodeJS + server-side jQuery ;-)

Ex-Crawler

  •    Java

Ex-Crawler is divided into 3 subprojects (Crawler Daemon, distributed gui Client, (web) search engine) which together provide a flexible and powerful search engine supporting distributed computing. More informations: http://ex-crawler.sourceforge.net

Squzer - Distributed Web Crawler

  •    Python

Squzer is the Declum's open-source, extensible, scale, multithreaded and quality web crawler project entirely written in the Python language.

NCrawler

  •    DotNet

Simple and very efficient multithreaded web crawler with pipeline based processing written in C#. Contains HTML, Text, PDF, and IFilter document processors and language detection(Google). Easy to add pipeline steps to extract, use and alter information.

Norconex HTTP Collector - Enterprise Web Crawler

  •    Java

Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repositoriy of your choice (e.g. a search engine). It very flexible, powerful, easy to extend, and portable.

Universal Information Crawler

  •    C

Universal information crawler is a fast precise and reliable Internet crawler. Uicrawler is a program/automated script which browses the World Wide Web in a methodical, automated manner and creates the index of documents that it accesses.

Nutch - Highly extensible, highly scalable Web crawler

  •    Java

Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

Arachnode.net

  •    CSharp

An open source .NET web crawler written in C# using SQL 2005/2008. Arachnode.net is a complete and comprehensive .NET web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages.

fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.

  •    Go

Package fetchbot provides a simple and flexible web crawler that follows the robots.txt policies and crawl delays.The package has a single external dependency, robotstxt. It also integrates code from the iq package.