Displaying 1 to 20 from 33 results

grab - Web Scraping Framework

  •    Python

Project Grab is not abandoned but it is not being actively developed. At current time I am working on another crawling framework which I want to be simple, fast and does not leak memory. New project is located here: https://github.com/lorien/crawler First, I've tried to use mix of asyncio (network) and classic threads (parsing HTML with lxml on multiple CPU cores) but then I've decided to use classic threads for everything for the sake of simplicity. Network requests are processed with pycurl because it is fast, feature-rich and supports socks5 proxies. You can try new framework but be aware it does not have many features yet. In particular, its options to configure network requests are very pure. If you need some option, feel free to create new issue.

Storm Crawler - Web crawler SDK based on Apache Storm

  •    Java

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. StormCrawler is a library and collection of resources that developers can leverage to build their own crawlers. The good news is that doing so can be pretty straightforward. Often, all you'll have to do will be to declare StormCrawler as a Maven dependency, write your own Topology class (tip : you can extend ConfigurableTopology), reuse the components provided by the project and maybe write a couple of custom ones for your own secret sauce.




Soup - Web Scraper in Go, similar to BeautifulSoup

  •    Go

soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSoup.

decapitated - Headless 'Chrome' Orchestration in R

  •    R

The ‘Chrome’ browser https://www.google.com/chrome/ has a headless mode which can be instrumented programmatically. Tools are provided to perform headless ‘Chrome’ instrumentation on the command-line, including retrieving the javascript-executed web page, PDF output or screen shot of a URL. A guess is made (but not verified yet) if HEADLESS_CHROME is non-existent.

splashr - :sweat_drops: Tools to Work with the 'Splash' JavaScript Rendering Service in R

  •    R

TL;DR: This package works with Splash rendering servers which are really just a REST API & lua scripting interface to a QT browser. It's an alternative to the Selenium ecosystem which was really engineered for application testing & validation. Sometimes, all you need is a page scrape after javascript has been allowed to roam wild and free over meticulously crafted HTML tags. So, this package does not do everything Selenium can in pure R (though, the Lua interface is equally as powerful and accessible via R), but if you're just trying to get a page back that needs javascript rendering, this is a nice, lightweight, consistent alternative.


sqrape - Simple Query Scraping with CSS and Go Reflection

  •    Go

When scraping web content, one usually hopes that the content is laid out logically, and that proper or at least consistent web annotation exists. This means well-nested HTML, appropriate use of tags, descriptive CSS classes and unique CSS IDs. Ideally it also means that a given CSS selector will yield a consistent datatype, also. ..well that's Sqrape. In fact, see examples/tweetgrab.go for the above as a CLI tool.

youtube_tutorials - Collection of scripts corresponding to LucidProgramming YouTube tutorials

  •    Python

LucidProgramming is my YouTube channel and this repo consists of a collection of scripts corresponding to YouTube tutorials for my YouTube channel. I would love to compile solutions to all of the problems here, as well as offer solutions in different languages. Just create a pull request with your changes.

codepen-puppeteer - Use Puppeteer to download pens from Codepen.io as single html pages

  •    Javascript

Use Puppeteer to download pens from Codepen.io as single html pages. Need help or have a question? post at StackOverflow.

trump-lies - Tutorial: Web scraping in Python with Beautiful Soup

  •    Jupyter

This repository contains the Jupyter notebook and dataset from Data School's introductory web scraping tutorial. All that is required to follow along is a basic understanding of the Python programming language. By the end of the tutorial, you will be able to scrape data from a static web page using the requests and Beautiful Soup libraries, and export that data into a structured text file using the pandas library.

user-agents - A JavaScript library for generating random user agents with data that's updated daily.

  •    Javascript

User-Agents is a JavaScript package for generating random User Agents based on how frequently they're used in the wild. A new version of the package is automatically released every day, so the data is always up to date. The generated data includes hard to find browser-fingerprint properties, and powerful filtering capabilities allow you to restrict the generated user agents to fit your exact needs. Web scraping often involves creating realistic traffic patterns, and doing so generally requires a good source of data. The User-Agents package provides a comprehensive dataset of real-world user agents and other browser properties which are commonly used for browser finerprinting and blocking automated web browsers. Unlike other random user agent generation libraries, the User-Agents package is updated automatically on a daily basis. This means that you can use it without worrying about whether the data will be stale in a matter of months.

act-page-analyzer - Apify actor that opens a web page in headless Chrome and analyzes the HTML and JavaScript objects, looks for schema

  •    Javascript

When analysis is finished it checks INPUT parameters if there are any strings to search for and if there are. Then it attempts to find the strings in all found content. The act ends when all output is parsed and searched. If connection to URL fails or if any part of act crashes, the act ends with error in output and log.

apify-js - Apify SDK: The ultimate web scraping and automation library for JavaScript / Node

  •    Javascript

The package provides helper functions to launch web browsers with proxies, access the storage etc. Note that the usage of the package is optional, you can create acts on Apify platform without it. If you deploy your code to Apify platform then you can set up scheduler or execute your code with web API.

ChiChew - :notebook_with_decorative_cover: 教育部《重編國語辭典修訂本》 網路爬蟲 :: A live web crawler for the Chinese-Chinese dictionary published by the Ministry of Education in Taiwan

  •    Python

教育部《重編國語辭典修訂本》 網路爬蟲 (即時資料查詢) A live web crawler for the Chinese-Chinese dictionary published by the Ministry of Education in Taiwan.

bancocentralbrasil - :brazil: Informações sobre taxas oficiais diárias de Inflação, Selic, Poupança, Dólar e Euro pelo site do Banco Central do Brasil

  •    Python

:brazil: Informações sobre taxas oficiais diárias de Inflação, Selic, Poupança, Dólar e Euro pelo site do Banco Central do Brasil

Humanoid - Node.js package to bypass CloudFlare's anti-bot JavaScript challenges

  •    Javascript

A Node.js package to bypass WAF anti-bot JS challenges. Humanoid is a Node.js package to solve and bypass CloudFlare (and hopefully in the future - other WAFs' as well) JavaScript anti-bot challenges. While anti-bot pages are solvable via headless browsers, they are pretty heavy and are usually considered over the top for scraping. Humanoid can solve these challenges using the Node.js runtime and present the protected HTML page. The session cookies can also be delegated to other bots to continue scraping causing them to avoid the JS challenges altogether.

gopa - [WIP] GOPA, a spider written in Golang, for Elasticsearch

  •    Go

GOPA, A Spider Written in Go. First of all, get it, two opinions: download the pre-built package or compile it yourself.