Displaying 1 to 14 from 14 results

cloudflare-scrape - A Python module to bypass Cloudflare's anti-bot page.

  •    Python

A simple Python module to bypass Cloudflare's anti-bot page (also known as "I'm Under Attack Mode", or IUAM), implemented with Requests. Cloudflare changes their techniques periodically, so I will update this repo frequently. This can be useful if you wish to scrape or crawl a website protected with Cloudflare. Cloudflare's anti-bot page currently just checks if the client supports Javascript, though they may add additional techniques in the future.

ferret - Declarative web scraping

  •    Go

ferret is a web scraping system aiming to simplify data extraction from the web for such things like UI testing, machine learning and analytics. Having its own declarative language, ferret abstracts away technical details and complexity of the underlying technologies, helping to focus on the data itself. It's extremely portable, extensible and fast. The following example demonstrates the use of dynamic pages. First of all, we load the main Google Search page, type search criteria into an input box and then click a search button. The click action triggers a redirect, so we wait till its end. Once the page gets loaded, we iterate over all elements in search results and assign the output to a variable. The final for loop filters out empty elements that might be because of inaccurate use of selectors.

listal - Download bot for listal.com

  •    Javascript

You will need to have NodeJS installed to use or develop this. To install as a command line tool, install it globally using NPM.

scrapman - Retrieve real (with Javascript executed) HTML code from an URL, ultra fast and supports multiple parallel loading of webs

  •    Javascript

Scrapman is a blazingly fast real (with Javascript executed) HTML scrapper, built from the ground up to support parallel fetches, with this you can get the HTML code for 50+ URLs in seconds (~30 seconds). On NodeJS you can easily use request to fetch the HTML from a page, but what if the page you are trying to load is NOT a static HTML page, but it has dynamic content added with Javascript? What do you do then? Well, you use The Scrapman.




big-data-upf - RECSM-UPF Summer School: Social Media and Big Data Research

  •    HTML

Citizens across the globe spend an increasing proportion of their daily lives online. Their activities leave behind granular, time-stamped footprints of human behavior and personal interactions that represent a new and exciting source of data to study standing questions about political and social behavior. At the same time, the volume and heterogeneity of web data present unprecedented methodological challenges. The goal of this course is to introduce participants to new computational methods and tools required to explore and analyze Big Data from online sources using the R programming language. We will focus in particular on data collected from social networking sites, such as Facebook and Twitter, whose use is becoming widespread in the social sciences. There are two ways you can follow the course and run the code contained in this GitHub repository. The recommended method is to connect to the provided RStudio server where all the R packages have already been installed, and all the R code is available. To access the server, visit bigdata.pablobarbera.com and log in with the information provided during class.

dataflowkit - Extract structured data from web sites. Web sites scraping.

  •    Go

Dataflow kit ("DFK") is a Web Scraping framework for Gophers. It extracts data from web pages, following the specified CSS Selectors. You can use it in many ways for data mining, data processing or archiving.

fulldom-server - Proxy-like server that will show you the DOM of a page after JS runs

  •    Javascript

This tells fulldom to bind to port 1337 on localhost only. The configuration keys are the same as the long-form CLI options (e.g. --port on the CLI corresponds to port in JSON).


ApkTrack - ApkTrack is an Android app which checks if updates for installed APKs are available.

  •    Java

ApkTrack checks, if updates for installed apps are available. It was created for users, who do not want use Google PlayStore, but still would like to be informed when new versions of their installed applications are available. ApkTrack performs simple website scraping to obtain the latest version information of APKs present on a device. It can query F-Droid, PlayStore, Xposed, plus many other sources of APKs via the ApkTrack Proxy.

crawly - Crawly, a high-level web crawling & scraping framework for Elixir.

  •    Elixir

Crawly is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival. In this section we will show how to bootstrap the small project and to setup Crawly for a proper data extraction.

Multi-Go - A multi-tool made in Go, and aimed at security experts to make life a little more convenient

  •    Go

A command line multi-tool made in Go, and aimed at security experts to make life a little more convenient. It does this by combining a massive array of different tasks, into one program. Multi Go is intended to be used on linux (mostly Debian & Ubuntu like distros). It might run on Windows. Currently it isn't tested, nor supported! I will eventually work on a Windows patch.

scraply - Scraply a simple dom scraper to fetch information from any html based website and convert that info to JSON APIs

  •    Go

Scraply a simple dom scraper to fetch information from any html based website and convert that info to JSON APIs