dataflowkit - Extract structured data from web sites. Web sites scraping.

  •        26

Dataflow kit ("DFK") is a Web Scraping framework for Gophers. It extracts data from web pages, following the specified CSS Selectors. You can use it in many ways for data mining, data processing or archiving.

https://dataflowkit.com
https://github.com/slotix/dataflowkit

Tags
Implementation
License
Platform

   




Related Projects

ferret - Declarative web scraping

  •    Go

ferret is a web scraping system aiming to simplify data extraction from the web for such things like UI testing, machine learning and analytics. Having its own declarative language, ferret abstracts away technical details and complexity of the underlying technologies, helping to focus on the data itself. It's extremely portable, extensible and fast. The following example demonstrates the use of dynamic pages. First of all, we load the main Google Search page, type search criteria into an input box and then click a search button. The click action triggers a redirect, so we wait till its end. Once the page gets loaded, we iterate over all elements in search results and assign the output to a variable. The final for loop filters out empty elements that might be because of inaccurate use of selectors.

headless-chrome-crawler - Distributed crawler powered by Headless Chrome

  •    Javascript

Crawlers based on simple requests to HTML files are generally fast. However, it sometimes ends up capturing empty bodies, especially when the websites are built on such modern frontend frameworks as AngularJS, React and Vue.js. Note: headless-chrome-crawler contains Puppeteer. During installation, it automatically downloads a recent version of Chromium. To skip the download, see Environment variables.

colly - Elegant Scraper and Crawler Framework for Golang

  •    Go

Colly provides a clean interface to write any kind of crawler/scraper/spider. With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

colly - Fast and Elegant Scraping Framework for Gophers

  •    Go

Colly provides a clean interface to write any kind of crawler/scraper/spider.With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

Scrapy - Web crawling & scraping framework for Python

  •    Python

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.


Goutte - Goutte, a simple PHP Web Scraper

  •    PHP

Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extract data from the HTML/XML responses.

awesome-puppeteer - A curated list of awesome puppeteer resources.

  •    

A curated list of awesome puppeteer resources for controlling headless Chrome (or Chromium) over the DevTools Protocol. Contributions welcome! Please read the contributing guideline first.

serverless-chrome - 🌐 Run headless Chrome/Chromium on AWS Lambda (maybe Azure, & GCP later)

  •    Javascript

Serverless Chrome contains everything you need to get started running headless Chrome on AWS Lambda (possibly Azure and GCP Functions soon). Why? Because it's neat. It also opens up interesting possibilities for using the Chrome DevTools Protocol (and tools like Chromeless or Puppeteer) in serverless architectures and doing testing/CI, web-scraping, pre-rendering, etc.

panther - A browser testing and web crawling library for PHP and Symfony

  •    PHP

Panther is a convenient standalone library to scrape websites and to run end-to-end tests using real browsers. Panther is super powerful, it leverages the W3C's WebDriver protocol to drive native web browsers such as Google Chrome and Firefox.

select.rs - A Rust library to extract useful data from HTML documents, suitable for web scraping.

  •    Rust

A library to extract useful data from HTML documents, suitable for web scraping. Note: All the API is currently unstable and will change as I use this library more in real world projects. If you have any suggestions or feedback, please open an issue or send me an email.

Revenant - A high level PhantomJS headless browser in Node.js ideal for task automation

  •    Javascript

A headless browser powered by PhantomJS functions in Node.js. Based on the PhantomJS-Node bridge.This library aims to abstract many of the simple functions one would use while testing or scraping a web page. Instead of running page.evaluate(...) and entering the javascript functions for a task, these tasks are abstracted for the user.

dryscrape - [not actively maintained] A lightweight Python library that uses Webkit to enable easy scraping of dynamic, Javascript-heavy web pages

  •    Python

NOTE: This package is not actively maintained. It uses QtWebkit, which is end-of-life and probably doesn't get security fixes backported. Consider using a similar package like Spynner instead. dryscrape is a lightweight web scraping library for Python. It uses a headless Webkit instance to evaluate Javascript on the visited pages. This enables painless scraping of plain web pages as well as Javascript-heavy “Web 2.0” applications like Facebook.

scraper - Simple web scraping for Google Chrome.

  •    Javascript

Simple web scraping for Google Chrome.

node-scraper - Easier web scraping using node.js and jQuery

  •    Javascript

A little module that makes scraping websites a little easier. Uses node.js and jQuery. First argument is an url as a string, second is a callback which exposes a jQuery object with your scraped site as "body" and third is an object from the request containing info about the url.

cyborg - Python web scraping framework

  •    Python

Cyborg is an asyncio Python 3 web scraping framework that helps you write programs to extract information from websites by reading and inspecting their HTML.

scraperjs - A complete and versatile web scraper.

  •    Javascript

Scraperjs is a web scraper module that make scraping the web an easy job. Try to spot the differences.

portia - Visual scraping for Scrapy

  •    Python

Portia is a tool that allows you to visually scrape websites without any programming knowledge required. With Portia you can annotate a web page to identify the data you wish to extract, and Portia will understand based on these annotations how to scrape data from similar pages. For more detailed instructions, and alternatives to using Docker, see the Installation docs.

facebook-page-post-scraper - Data scraper for Facebook Pages, and also code accompanying the blog post How to Scrape Data From Facebook Page Posts for Statistical Analysis

  •    Python

UPDATE December 2017: Due to a bug on Facebook's end, using this scraper will only return a very small subset of posts (5-10% of posts) over a limited timeframe. Since Facebook now owns CrowdTangle, the (paid) canonical source of historical Facebook data, Facebook doesn't have an incentive to fix the linked bug. On December 12th, a Facebook engineer commented that they are developing a new endpoint for scraping posts chronologically. I will refactor this script once that happens. Until then, there likely will not be any PRs accepted.

web-scraper-chrome-extension - Web data extraction tool implemented as chrome extension

  •    Javascript

Web Scraper is a chrome browser extension built for data extraction from web pages. Using this extension you can create a plan (sitemap) how a web site should be traversed and what should be extracted. Using these sitemaps the Web Scraper will navigate the site accordingly and extract all data. Scraped data later can be exported as CSV. When submitting a bug please attach an exported sitemap if possible.

Lulu - [Unmaintained] A simple and clean video/music/image downloader 👾

  •    Python

Sorry for this. Lulu is a friendly you-get fork (⏬ Dumb downloader that scrapes the web).