Displaying 1 to 20 from 155 results

cheerio - Fast, flexible, and lean implementation of core jQuery designed specifically for the server

  •    Javascript

❤ Familiar syntax: Cheerio implements a subset of core jQuery. Cheerio removes all the DOM inconsistencies and browser cruft from the jQuery library, revealing its truly gorgeous API.ϟ Blazingly fast: Cheerio works with a very simple, consistent DOM model. As a result parsing, manipulating, and rendering are incredibly efficient. Preliminary end-to-end benchmarks suggest that cheerio is about 8x faster than JSDOM.

colly - Fast and Elegant Scraping Framework for Gophers

  •    Go

Colly provides a clean interface to write any kind of crawler/scraper/spider.With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

x-ray - The next web scraper. See through the <html> noise.

  •    Javascript

Looking for a career upgrade? Check out the available Node.js & Javascript positions at these innovative companies.Flexible schema: Supports strings, arrays, arrays of objects, and nested object structures. The schema is not tied to the structure of the page you're scraping, allowing you to pull the data in the structure of your choosing.

newspaper - 💡 News, full-text, and article metadata extraction in Python 3. Advanced docs:

  •    Python

Newspaper has seamless language extraction and detection. If no language is specified, Newspaper will attempt to auto detect a language. Check out The Documentation for full and detailed guides using newspaper.




noodle - A node server and module which allows for cross-domain page scraping on web documents with JSONP or POST

  •    Javascript

The noodle tests create a temporary server on port 8889 which the automated tests tell noodle to query against. Contributors and suggestions welcomed.

scrape-it - :crystal_ball: A Node.js scraper for humans.

  •    Javascript

A Node.js scraper for humans. Please post questions on Stack Overflow. You can open issues with questions, as long you add a link to your Stack Overflow question.

scraperjs - A complete and versatile web scraper.

  •    Javascript

Scraperjs is a web scraper module that make scraping the web an easy job. Try to spot the differences.

huginn - Create agents that monitor and act on your behalf. Your agents are standing by!

  •    Ruby

Huginn is a system for building agents that perform automated tasks for you online. They can read the web, watch for events, and take actions on your behalf. Huginn's Agents create and consume events, propagating them along a directed graph. Think of it as a hackable version of IFTTT or Zapier on your own server. You always know who has your data. You do. Join us in our Gitter room to discuss the project.


annie - 👾 Fast, simple and clean video downloader

  •    Go

👾 Annie is a fast, simple and clean video downloader built with Go. The following dependencies are required and must be installed separately.

headless-chrome-crawler - Distributed crawler powered by Headless Chrome

  •    Javascript

Crawlers based on simple requests to HTML files are generally fast. However, it sometimes ends up capturing empty bodies, especially when the websites are built on such modern frontend frameworks as AngularJS, React and Vue.js. Note: headless-chrome-crawler contains Puppeteer. During installation, it automatically downloads a recent version of Chromium. To skip the download, see Environment variables.

colly - Elegant Scraper and Crawler Framework for Golang

  •    Go

Colly provides a clean interface to write any kind of crawler/scraper/spider. With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

scanless - online port scan scraper

  •    Python

Command-line utility for using websites that can perform port scans on your behalf.

ferret - Declarative web scraping

  •    Go

ferret is a web scraping system aiming to simplify data extraction from the web for such things like UI testing, machine learning and analytics. Having its own declarative language, ferret abstracts away technical details and complexity of the underlying technologies, helping to focus on the data itself. It's extremely portable, extensible and fast. The following example demonstrates the use of dynamic pages. First of all, we load the main Google Search page, type search criteria into an input box and then click a search button. The click action triggers a redirect, so we wait till its end. Once the page gets loaded, we iterate over all elements in search results and assign the output to a variable. The final for loop filters out empty elements that might be because of inaccurate use of selectors.

node-read - Get Readable Content from any page

  •    Javascript

Get Clean Reading Content from every web page

node-ytdl-core - Youtube downloader in javascript.

  •    HTML

Yet another youtube downloading module. Written with only Javascript and a node-friendly streaming interface. For a CLI version of this, check out ytdl and pully.

Lulu - [Unmaintained] A simple and clean video/music/image downloader 👾

  •    Python

Sorry for this. Lulu is a friendly you-get fork (⏬ Dumb downloader that scrapes the web).

scala-scraper - A Scala library for scraping content from HTML pages

  •    Scala

A library providing a DSL for loading and extracting content from HTML pages. Take a look at Examples.scala and at the unit specs for usage examples or keep reading for more thorough documentation. Feel free to use GitHub Issues for submitting any bug or feature request and Gitter to ask questions.

ImageScraper - :scissors: High performance, multi-threaded image scraper

  •    Python

A high performance, easy to use, multithreaded command line tool which downloads images from the given webpage. Note that ImageScraper depends on lxml, requests, setproctitle, and future. If you run into problems in the compilation of lxml through pip, install the libxml2-dev and libxslt-dev packages on your system.

Media Companion

  •    

Media Companion allows users to catalogue their Movie and TV show collections with a fine degree of control, including posters, fanart and episode screenshots.