Displaying 1 to 20 from 32 results

colly - Fast and Elegant Scraping Framework for Gophers

  •    Go

Colly provides a clean interface to write any kind of crawler/scraper/spider.With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

newspaper - 💡 News, full-text, and article metadata extraction in Python 3. Advanced docs:

  •    Python

Newspaper has seamless language extraction and detection. If no language is specified, Newspaper will attempt to auto detect a language. Check out The Documentation for full and detailed guides using newspaper.

headless-chrome-crawler - Distributed crawler powered by Headless Chrome

  •    Javascript

Crawlers based on simple requests to HTML files are generally fast. However, it sometimes ends up capturing empty bodies, especially when the websites are built on such modern frontend frameworks as AngularJS, React and Vue.js. Note: headless-chrome-crawler contains Puppeteer. During installation, it automatically downloads a recent version of Chromium. To skip the download, see Environment variables.

colly - Elegant Scraper and Crawler Framework for Golang

  •    Go

Colly provides a clean interface to write any kind of crawler/scraper/spider. With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.




ferret - Declarative web scraping

  •    Go

ferret is a web scraping system aiming to simplify data extraction from the web for such things like UI testing, machine learning and analytics. Having its own declarative language, ferret abstracts away technical details and complexity of the underlying technologies, helping to focus on the data itself. It's extremely portable, extensible and fast. The following example demonstrates the use of dynamic pages. First of all, we load the main Google Search page, type search criteria into an input box and then click a search button. The click action triggers a redirect, so we wait till its end. Once the page gets loaded, we iterate over all elements in search results and assign the output to a variable. The final for loop filters out empty elements that might be because of inaccurate use of selectors.

Lulu - [Unmaintained] A simple and clean video/music/image downloader 👾

  •    Python

Sorry for this. Lulu is a friendly you-get fork (⏬ Dumb downloader that scrapes the web).

awesome-puppeteer - A curated list of awesome puppeteer resources.

  •    

A curated list of awesome puppeteer resources for controlling headless Chrome (or Chromium) over the DevTools Protocol. Contributions welcome! Please read the contributing guideline first.

Crawlme - Ajax crawling for your web application

  •    Javascript

#Crawlme A Connect/Express middleware that makes your node.js web application indexable by search engines. Crawlme generates static HTML snapshots of your JavaScript web application on the fly and has a built in periodically refreshing in-memory cache, so even though the snapshot generation may take a second or two, search engines will get them really fast. This is beneficial for SEO since response time is one of the factors used in the page rank algorithm.Making ajax applications crawlable has always been tricky since search engines don't execute the JavaScript on the web sites they crawl. The solution to this is to provide the search engines with pre-rendered HTML versions of each page on your site, but creating those HTML versions has until now been a tedious and error prone process with many manual steps. Crawlme fixes this by rendering HTML snapshots of your web application on the fly whenever the Googlebot crawls your site. Apart from making the process of more or less manually creating indexable HTML versions of your site obsolete, this also has the benefit that Google will always index the latest version of your site and not some old pre-rendered version.


spider2 - A 2nd generation spider to crawl any article site, automatic read title and article.

  •    Javascript

A 2nd generation spider to crawl any article site, automatic reading title and content.In my case, the speed of spider is about 700 thousands documents per day, 22 million per month, and the maximize crawling speed is 450 per minute, avg 80 per minute, the memory cost are about 200 megabytes on each spider kernel, and the accuracy is about 90%, the rest 10% can be fixed by customizing Score Rules or Selectors. it's better than any other readability modules.

twilight - Twitter Streaming API — tools and data transformations

  •    Javascript

Tools for accessing the Twitter API v1.1 with paranoid timeouts and de-pagination. Tab / space seperated is fine, and any other columns will simply be ignored, e.g., if you want to record the screen_name of each account. Also, order doesn't matter -- your headers just have to line up with their values.

diffbot-php-client - The official Diffbot client library

  •    PHP

This package is a slightly overengineered Diffbot API wrapper. It uses PSR-7 and PHP-HTTP friendly client implementations to make API calls. To learn more about Diffbot see here and their homepage. Right now it only supports Analyze, Product, Image, Discussion, Crawl, Search, and Article APIs, but can also accommodate Custom APIs. Video and Bulk API support coming soon. Full documentation available here.

node-goldwasher - Extraction of text and related metadata.

  •    Javascript

Alternatively, the output can be configured as XML, Atom or RSS format with the output option. The reason redundant information is included, such as the source, is that each returned nugget is supposed to be an atomic piece of information. As such, each nugget is to contain the information that "somewhere, at some point in time, something was written (with a link to some place)".

node-krawler - Fast and lightweight web crawler with built-in cheerio, xml and json parser.

  •    Javascript

mikeal/request is used for fetching web pages so any desired option from this package can be passed to Krawler's constructor. After Krawler emits the 'data' event, it automatically continues to a next url address. It does not care if the result was processed or not. If you would like to have a full control over the result handling, you can turn on the custom callback option. Then you can control the program flow by invoking your callback. Don't forget to call it in every case, otherwise the queue will stuck.

pomp - Screen scraping and web crawling framework

  •    Python

Pomp is a screen scraping and web crawling framework. Pomp is inspired by and similar to Scrapy, but has a simpler implementation that lacks the hard Twisted dependency. If you want proxies, redirects, or similar, you may use the excellent requests library as the Pomp downloader.

js-crawler - Web crawler for Node.JS

  •    TypeScript

Web crawler for Node.JS, both HTTP and HTTPS are supported. The call to configure is optional, if it is omitted the default option values will be used.

corpuscrawler - Crawler for linguistic corpora

  •    Python

Corpus Crawler is a tool for Corpus Linguistics. Modern linguistic research works on language corpora, which are large samples of “real world” text. This crawler helps to build such corpora: it follows links to publicly accessible web pages known to be written in a certain language; it removes boilerplate and HTML markup; finally, it writes its output into plaintext files. The crawler implements the Robots Exclusion Standard, and it is intentionally slow so it does not cause much load on the crawled web sites.

XML-Parser - A Node.js XML DOM, Parser & Stringifier.

  •    Javascript

Parse XML, HTML and more with a very tolerant XML parser and convert it into a DOM. These three components are separated from each other as own modules.