headless-chrome-crawler - Distributed crawler powered by Headless Chrome

  •        104

Crawlers based on simple requests to HTML files are generally fast. However, it sometimes ends up capturing empty bodies, especially when the websites are built on such modern frontend frameworks as AngularJS, React and Vue.js. Note: headless-chrome-crawler contains Puppeteer. During installation, it automatically downloads a recent version of Chromium. To skip the download, see Environment variables.

https://github.com/yujiosaka/headless-chrome-crawler#readme

Dependencies:

debug : 3.1.0
jquery : 3.3.1
lodash : 4.17.5
puppeteer : 1.5.0
request : 2.87.0
request-promise : 4.2.2
robots-parser : 1.0.2

Tags
Implementation
License
Platform

   




Related Projects

serverless-chrome - 🌐 Run headless Chrome/Chromium on AWS Lambda (maybe Azure, & GCP later)

  •    Javascript

Serverless Chrome contains everything you need to get started running headless Chrome on AWS Lambda (possibly Azure and GCP Functions soon). Why? Because it's neat. It also opens up interesting possibilities for using the Chrome DevTools Protocol (and tools like Chromeless or Puppeteer) in serverless architectures and doing testing/CI, web-scraping, pre-rendering, etc.

puppeteer - Headless Chrome Node API

  •    Javascript

Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome or Chromium. Note: When you install Puppeteer, it downloads a recent version of Chromium (~170Mb Mac, ~282Mb Linux, ~280Mb Win) that is guaranteed to work with the API. To skip the download, see Environment variables.

awesome-puppeteer - A curated list of awesome puppeteer resources.

  •    

A curated list of awesome puppeteer resources for controlling headless Chrome (or Chromium) over the DevTools Protocol. Contributions welcome! Please read the contributing guideline first.

pyppeteer - Headless chrome/chromium automation library (unofficial port of puppeteer)

  •    Python

Unofficial Python port of puppeteer JavaScript (headless) chrome/chromium browser automation library. Note: When you run pyppeteer first time, it downloads a recent version of Chromium (~100MB). If you don't prefer this behavior, run pyppeteer-install command before running scripts which uses pyppeteer.

rendora - dynamic server-side rendering using headless Chrome to effortlessly solve the SEO problem for modern javascript websites

  •    Go

Rendora can be seen as a reverse HTTP proxy server sitting between your backend server (e.g. Node.js/Express.js, Python/Django, etc...) and potentially your frontend proxy server (e.g. nginx, traefik, apache, etc...) or even directly to the outside world that does actually nothing but transporting requests and responses as they are except when it detects whitelisted requests according to the config. In that case, Rendora instructs a headless Chrome instance to request and render the corresponding page and then return the server-side rendered page back to the client (i.e. the frontend proxy server or the outside world). This simple functionality makes Rendora a powerful dynamic renderer without actually changing anything in both frontend and backend code. Dynamic rendering means that the server provides server-side rendered HTML to web crawlers such as GoogleBot and BingBot and at the same time provides the typical initial HTML to normal users in order to be rendered at the client side. Dynamic rendering is meant to improve SEO for websites written in modern javascript frameworks like React, Vue, Angular, etc...


html-pdf-chrome - HTML to PDF converter via Chrome/Chromium

  •    TypeScript

HTML to PDF converter via Chrome/Chromium. Note: It is strongly recommended that you keep Chrome running side-by-side with Node.js. There is significant overhead starting up Chrome for each PDF generation which can be easily avoided.

headless-devtools - Lets you perform Chrome DevTools actions from code by leveraging Headless Chrome+Puppeteer

  •    Javascript

Lets you perform Chrome DevTools actions from code by leveraging Headless Chrome+Puppeteer. Chrome DevTools is great for getting valuable information about your app 🕵️‍♂️. Using headless-devtools you can automate this process 🤖. One use-case is to collect this data over time 📈, which can help you keep your app in good health 👩‍⚕️.

cuprite - Headless Chrome driver for Capybara

  •    Ruby

Cuprite is a pure Ruby driver (read as no Java/Selenium/WebDriver/ChromeDriver requirement) for Capybara. It allows you to run your Capybara tests on a headless Chrome or Chromium by CDP protocol. Under the hood it uses Ferrum which is high-level API to the browser again by CDP protocol. The emphasis was made on raw CDP protocol because Headless Chrome allows you to do so many things that are barely supported by WebDriver because it should have consistent design with other browsers. The design of the driver will be as close to Poltergeist as possible though it's not a goal.

pdf-bot - 🤖 A Node queue API for generating PDFs using headless Chrome

  •    Javascript

Easily create a microservice for generating PDFs using headless Chrome. pdf-bot is installed on a server and will receive URLs to turn into PDFs through its API or CLI. pdf-bot will manage a queue of PDF jobs. Once a PDF job has run it will notify you using a webhook so you can fetch the API. pdf-bot supports storing PDFs on S3 out of the box. Failed PDF generations and Webhook pings will be retried after a configurable decaying schedule.

taiko - A node.js library to automate chrome/chromium browser

  •    Javascript

Taiko is a free and open source browser automation tool built by the team behind Gauge from ThoughtWorks. Taiko is a node library with a clear and concise API to automate the chrome browser. Tests written in Taiko are highly readable and maintainable. Taiko’s smart selectors make tests reliable by adapting to changes in the structure of your web application. With Taiko there’s no need for id/css/xpath selectors or adding explicit waits (for XHR requests) in test scripts.

chrome-headless-screenshots - Using headless Chrome as an automated screenshot tool (alternative to PhantomJS)

  •    Javascript

May 2018 The Chrome team has released Puppeteer, which makes much of the code in this repository obsolete. As such, this project is no longer maintained and issues have been disabled. Please check out Puppeteer. This repo contains an example implementation of using headless Chrome as an automated screenshot tool on linux, which is a common use case for PhantomJS. Contributions are welcome.

sms-boom - 利用chrome的headless模式,模拟用户注册进行短信轰炸机

  •    Javascript

开启chrome的headless模式,仿真模拟用户去注册... 每一个人都可以是贡献者。 如果你发现有的网站,可以作为短信提供者,请在issue中提出,或PR.

mochify.js - ☕️ TDD with Browserify, Mocha, Headless Chrome and WebDriver

  •    Javascript

Browserifies ./test/*.js, decorated with a Mocha test runner, runs it in Headless Chrome and passes the output back to your console. Cleans up your stack traces by mapping back to the original sources and removing lines from the test framework.For proxy settings and other environment variables, see the Puppeteer documentation.

chrome-headless-browser-docker - Continuously building Chrome Docker image for Linux.

  •    Shell

This repository contains three docker builds. This docker image contains the Linux Dev channel Chromium (https://www.chromium.org/getting-involved/dev-channel), with the required dependencies and the command line argument running headless mode provided.

chrome-har-capturer - Capture HAR files from a headless Chrome instance

  •    Javascript

Capture HAR files from a headless Chrome instance. Under the hood this module uses chrome-remote-interface to instrument Chrome.

rendertron - A dockerized, headless Chrome rendering solution

  •    Javascript

Rendertron is a dockerized, headless Chrome rendering solution designed to render & serialise web pages on the fly. Rendertron is designed to enable your Progressive Web App (PWA) to serve the correct content to any bot that doesn't render or execute JavaScript. Rendertron runs as a standalone HTTP server. Rendertron renders requested pages using Headless Chrome, auto-detecting when your PWA has completed loading and serializes the response back to the original request. To use Rendertron, your application configures middleware to determine whether to proxy a request to Rendertron. Rendertron is compatible with all client side technologies, including web components.

Revenant - A high level PhantomJS headless browser in Node.js ideal for task automation

  •    Javascript

A headless browser powered by PhantomJS functions in Node.js. Based on the PhantomJS-Node bridge.This library aims to abstract many of the simple functions one would use while testing or scraping a web page. Instead of running page.evaluate(...) and entering the javascript functions for a task, these tasks are abstracted for the user.

navalia - A bullet-proof, fast, and reliable headless browser API

  •    TypeScript

The bullet-proof, fast, and most feature-rich Chrome driver around. Navalia lets you interact with Chrome and run parallel work with ease. Not using JavaScript? There's a GraphQL server that you can use to communicate with over HTTP allowing any runtime to drive Chrome. Simply run navalia with a specified port e.g.

chromeless - 🖥 Chrome automation made simple. Runs locally or headless on AWS Lambda.

  •    TypeScript

You can try out Chromeless and explore the API in the browser-based demo playground (source).With Chromeless you can control Chrome (open website, click elements, fill out forms...) using an elegant API. This is useful for integration tests or any other scenario where you'd need to script a real browser.

chromeless - 🖥 Chrome automation made simple. Runs locally or headless on AWS Lambda.

  •    TypeScript

You can try out Chromeless and explore the API in the browser-based demo playground (source). With Chromeless you can control Chrome (open website, click elements, fill out forms...) using an elegant API. This is useful for integration tests or any other scenario where you'd need to script a real browser.