Displaying 1 to 20 from 37 results

colly - Fast and Elegant Scraping Framework for Gophers

  •    Go

Colly provides a clean interface to write any kind of crawler/scraper/spider.With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

newspaper - 💡 News, full-text, and article metadata extraction in Python 3. Advanced docs:

  •    Python

Newspaper has seamless language extraction and detection. If no language is specified, Newspaper will attempt to auto detect a language. Check out The Documentation for full and detailed guides using newspaper.

annie - 👾 Fast, simple and clean video downloader

  •    Go

👾 Annie is a fast, simple and clean video downloader built with Go. The following dependencies are required and must be installed separately.

headless-chrome-crawler - Distributed crawler powered by Headless Chrome

  •    Javascript

Crawlers based on simple requests to HTML files are generally fast. However, it sometimes ends up capturing empty bodies, especially when the websites are built on such modern frontend frameworks as AngularJS, React and Vue.js. Note: headless-chrome-crawler contains Puppeteer. During installation, it automatically downloads a recent version of Chromium. To skip the download, see Environment variables.




ferret - Declarative web scraping

  •    Go

ferret is a web scraping system aiming to simplify data extraction from the web for such things like UI testing, machine learning and analytics. Having its own declarative language, ferret abstracts away technical details and complexity of the underlying technologies, helping to focus on the data itself. It's extremely portable, extensible and fast. The following example demonstrates the use of dynamic pages. First of all, we load the main Google Search page, type search criteria into an input box and then click a search button. The click action triggers a redirect, so we wait till its end. Once the page gets loaded, we iterate over all elements in search results and assign the output to a variable. The final for loop filters out empty elements that might be because of inaccurate use of selectors.

colly - Elegant Scraper and Crawler Framework for Golang

  •    Go

Colly provides a clean interface to write any kind of crawler/scraper/spider. With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

jd-autobuy - Python爬虫,京东自动登录,在线抢购商品

  •    Python

代码仅供学习之用,京东网页不断变化,代码并不一定总是能正常运行。 如果您发现有Bug,Welcome to Pull Request.


Lulu - [Unmaintained] A simple and clean video/music/image downloader 👾

  •    Python

Sorry for this. Lulu is a friendly you-get fork (⏬ Dumb downloader that scrapes the web).

crawler - A high performance web crawler in Elixir.

  •    Elixir

A high performance web crawler in Elixir, with worker pooling and rate limiting via OPQ. Below is a very high level architecture diagram demonstrating how Crawler works.

Revenant - A high level PhantomJS headless browser in Node.js ideal for task automation

  •    Javascript

A headless browser powered by PhantomJS functions in Node.js. Based on the PhantomJS-Node bridge.This library aims to abstract many of the simple functions one would use while testing or scraping a web page. Instead of running page.evaluate(...) and entering the javascript functions for a task, these tasks are abstracted for the user.

recrawler - Remote web content crawler done right.

  •    Javascript

Remote web content crawler done right.Sometimes I want to grab some nice images from a url like http://bbs.005.tv/thread-492392-1-1.html, so I made this little program to combine node-fetch and cheerio to make my attempt fulfilled.

PainlessCrawler - (WIP) A painless Node.js web crawler that simply works

  •    Javascript

Working with some of the top Node.js crawlers on GitHub have led to much frustration with regards to getting something which simply works, and I ended up spending many hours playing around and figuring out how they were supposed to work.As such, I wanted to build a first and foremost painless crawler that just works with clear documetation, which allows the user to focus on more important tasks.

freshonions-torscraper - Fresh Onions is an open source TOR spider / hidden service onion crawler hosted at zlal32teyptf4tvi

  •    Python

This is a copy of the source for the http://zlal32teyptf4tvi.onion hidden service, which implements a tor hidden service crawler / spider and web site. This software is made available under the GNU Affero GPL 3 License.. What this means is that is you deploy this software as part of networked software that is available to the public, you must make the source code available (and any modifications).

sandcrawler - sandcrawler.js - the server-side scraping companion.

  •    Javascript

sandcrawler.js is a node library aiming at providing developers with concise but exhaustive tools to scrape the web. Disclaimer: this library is an unreleased work in progress.

node-tarantula - web crawler/spider for nodejs

  •    Javascript

nodejs crawler/spider which provides a simple interface for crawling the Web. Its API has been inspired by crawler4j.

robotstxt - robots.txt file parsing and checking for R

  •    R

Provides functions to download and parse ‘robots.txt’ files. Ultimately the package makes it easy to check if bots (spiders, crawler, scrapers, …) are allowed to access specific resources on a domain.