Displaying 1 to 17 from 17 results

colly - Fast and Elegant Scraping Framework for Gophers

  •    Go

Colly provides a clean interface to write any kind of crawler/scraper/spider.With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

newspaper - 💡 News, full-text, and article metadata extraction in Python 3. Advanced docs:

  •    Python

Newspaper has seamless language extraction and detection. If no language is specified, Newspaper will attempt to auto detect a language. Check out The Documentation for full and detailed guides using newspaper.

headless-chrome-crawler - Distributed crawler powered by Headless Chrome

  •    Javascript

Crawlers based on simple requests to HTML files are generally fast. However, it sometimes ends up capturing empty bodies, especially when the websites are built on such modern frontend frameworks as AngularJS, React and Vue.js. Note: headless-chrome-crawler contains Puppeteer. During installation, it automatically downloads a recent version of Chromium. To skip the download, see Environment variables.

ferret - Declarative web scraping

  •    Go

ferret is a web scraping system aiming to simplify data extraction from the web for such things like UI testing, machine learning and analytics. Having its own declarative language, ferret abstracts away technical details and complexity of the underlying technologies, helping to focus on the data itself. It's extremely portable, extensible and fast. The following example demonstrates the use of dynamic pages. First of all, we load the main Google Search page, type search criteria into an input box and then click a search button. The click action triggers a redirect, so we wait till its end. Once the page gets loaded, we iterate over all elements in search results and assign the output to a variable. The final for loop filters out empty elements that might be because of inaccurate use of selectors.




colly - Elegant Scraper and Crawler Framework for Golang

  •    Go

Colly provides a clean interface to write any kind of crawler/scraper/spider. With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

Lulu - [Unmaintained] A simple and clean video/music/image downloader 👾

  •    Python

Sorry for this. Lulu is a friendly you-get fork (⏬ Dumb downloader that scrapes the web).

node-krawler - Fast and lightweight web crawler with built-in cheerio, xml and json parser.

  •    Javascript

mikeal/request is used for fetching web pages so any desired option from this package can be passed to Krawler's constructor. After Krawler emits the 'data' event, it automatically continues to a next url address. It does not care if the result was processed or not. If you would like to have a full control over the result handling, you can turn on the custom callback option. Then you can control the program flow by invoking your callback. Don't forget to call it in every case, otherwise the queue will stuck.

pomp - Screen scraping and web crawling framework

  •    Python

Pomp is a screen scraping and web crawling framework. Pomp is inspired by and similar to Scrapy, but has a simpler implementation that lacks the hard Twisted dependency. If you want proxies, redirects, or similar, you may use the excellent requests library as the Pomp downloader.


js-crawler - Web crawler for Node.JS

  •    TypeScript

Web crawler for Node.JS, both HTTP and HTTPS are supported. The call to configure is optional, if it is omitted the default option values will be used.

XML-Parser - A Node.js XML DOM, Parser & Stringifier.

  •    Javascript

Parse XML, HTML and more with a very tolerant XML parser and convert it into a DOM. These three components are separated from each other as own modules.

gopa - [WIP] GOPA, a spider written in Golang, for Elasticsearch

  •    Go

GOPA, A Spider Written in Go. First of all, get it, two opinions: download the pre-built package or compile it yourself.

proxifier - A fast, modern and intelligent proxy rotator perfect for crawling and scraping public data

  •    Go

A fast, modern and intelligent proxy rotator perfect for crawling and scraping public data. Proxifier act as a proxy and remotely send and receive requests and responses from other proxies. Firstly, just download and install proxifier.

easy-scraping-tutorial - Simple but useful Python web scraping tutorial code.

  •    Jupyter

In these tutorials, we will learn to build some simple but useful scrapers from scratch. Get to know how we can read web page and select sections you need or even download files. If you understand Chinese, you are lucky! I made Chinese video + text tutorials for all of these contents. You can find it in 莫烦Python. Learning from code, I made two options for you.

antch - Antch, a fast, powerful and extensible web crawling & scraping framework for Go

  •    Go

Antch, inspired by Scrapy. If you're familiar with scrapy, you can quickly get started. Antch is a fast, powerful and extensible web crawling & scraping framework for Go, used to crawl websites and extract structured data from their pages.

go-scrapy - Web crawling and scraping framework for Golang

  •    Go

go-scrapy is a very useful and productive web crawlign framework, used to crawl websites and extract structured data from parsed pages. Please go through examples to get an idea how to use this package.