Displaying 1 to 20 from 34 results

huginn - Create agents that monitor and act on your behalf. Your agents are standing by!

  •    Ruby

Huginn is a system for building agents that perform automated tasks for you online. They can read the web, watch for events, and take actions on your behalf. Huginn's Agents create and consume events, propagating them along a directed graph. Think of it as a hackable version of IFTTT or Zapier on your own server. You always know who has your data. You do. Join us in our Gitter room to discuss the project.

autoscraper - A Smart, Automatic, Fast and Lightweight Web Scraper for Python

  •    Python

This project is made for automatic web scraping to make scraping easy. It gets a url or the html content of a web page and a list of sample data which we want to scrape from that page. This data can be text, url or any html tag value of that page. It learns the scraping rules and returns the similar elements. Then you can use this learned object with new urls to get similar content or the exact same element of those new pages. It's compatible with python 3.

Soup - Web Scraper in Go, similar to BeautifulSoup

  •    Go

soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSoup.

morph - Take the hassle out of web scraping

  •    Ruby

Development is supported on Linux and Mac OS X. Just follow the instructions on the Docker site.

r-web-scraping-cheat-sheet - Guide, reference and cheatsheet on web scraping using rvest, httr and Rselenium

  •    R

Inspired by Hartley Brody, this cheat sheet is about web scraping using rvest,httr and Rselenium. It covers many topics in this blog. While Hartley uses python's requests and beautifulsoup libraries, this cheat sheet covers the usage of httr and rvest. While rvest is good enough for many scraping tasks, httr is required for more advanced techniques. Usage of Rselenium(web driver) is also covered.

webchem - Chemical Information from the Web

  •    R

webchem is a R package to retrieve chemical information from the web. This package interacts with a suite of web APIs to retrieve chemical information.The functions in the package that hit a specific API have a prefix and suffix separated by an underscore (prefix_suffix()) They follow the format of source_functionality, e.g.cs_compinfo uses ChemSpider to retrieve compound informations.

falkor - Open Source web scraping API. Falkor turns web pages into queryable JSON

  •    Clojure

Open Source web scraping API. Falkor turns web pages into queryable JSON

pyparsing-webscraping-appcontrol-datawrangling - Slides and code for my talk: Using PyParsing For Web Scraping, Application Control and Data Wrangling

  •    Python

When scraping websites one must always observe the terms of service of that website. The spider that I provide in this repo is for educational purposes only.

webhog - Downloads and stores a given URL (including js, css, and images) for offline use.

  •    Javascript

webhog is a package that stores and downloads a given URL (including js, css, and images) for offline use and uploads it to a given AWS-S3 account (more persistance options to come). ##Usage Make a POST request to http://localhost:3000/scrape with a header set to value X-API-KEY: SCRAPEAPI. Pass in a JSON value of the URL you'd like to fetch: { "url": "http://facebook.com"} (as an example). You'll notice an Ent dir: /blah/blah/blah printed to the console - your assets are saved there. To test, open the given index.html file.

robotstxt - robots.txt file parsing and checking for R

  •    R

Provides functions to download and parse ‘robots.txt’ files. Ultimately the package makes it easy to check if bots (spiders, crawler, scrapers, …) are allowed to access specific resources on a domain.

chesf - CHeSF is the Chrome Headless Scraping Framework, a very very alpha code to scrape javascript intensive web pages

  •    Python

In the era of Big Data, the web is an endless source of information. For this reason, there are plenty of good tools/frameworks to perform scraping of web pages. So, I guess, in an ideal world there should be no need of a new web scraping framework. Nevertheless, there are always subtle differences between theory and practice. The case of web scraping made no exceptions.

decryptr - An extensible API for breaking captchas

  •    R

decryptr is an R package to break captchas. It is also an extensible tool built in a way that enables anyone to contribute with their own captcha-breaking code. Simple, right? The decrypt() funcion is this package's workhorse: it is able to take a captcha (either the path to a captcha file or a captcha object read with read_captcha()) and break it with a model (either the name of a known model, the path to a model file or a model object created with train_model()).

PacPaw - Pawn package manager for SA-MP

  •    Python

PacPaw is pawn package manager for SAMP wrriten in python and is still under developement.It mainly relies on webscraping with BeautifulSoup.In addition to it it also helps scripters for gathering snippets based on pawn and function references documented for SA-MP.

NBA_Predictions - Reworked NBA Predictions (in Python)

  •    Python

(Note) Please note that this was written around January 2015. These scripts rely heavily on webscraping to access the required data. Websites regularly change their layouts and locations. Because of this, the webscraping may fail, which causes no updated predictions to be made. I will eventually get around this fall to making sure this runs. I've decided to rework my NBA Prediction code from R to Python. Mostly to see if I could do it, and also to see if I could speed it up a bit. I'll update here with current speed/accuracy results as the 2014-15 season plays out. The structure and format is pretty much the same, with the exception that it's cleaner code. I still need to comment it a bit more, but it's Git ready for now.

SforSwagBot - A telegram chat bot for : Getting lyrics, Getting nearby restaurants and their menu and random quotes

  •    Python

A telegram chat bot for : Getting lyrics, Getting nearby restaurants and their menu and random quotes.

Rcrawler - An R web crawler and scraper

  •    R

Rcrawler is an R package for web crawling websites and extracting structured data which can be used for a wide range of useful applications, like web mining, text mining, web content mining, and web structure mining. So what is the difference between Rcrawler and rvest : rvest extracts data from one specific page by navigating through selectors. However, Rcrawler automatically traverses and parse all web pages of a website, and extract all data you need from them at once with a single command. For example collect all published posts on a blog, or extract all products on a shopping website, or gathering comments, reviews for your opinion mining studies. More than that, Rcrawler can help you studies web site structure by building a network representation of a website internal and external hyperlinks (nodes & edges). Help us improve Rcrawler by asking questions, revealing issues, suggesting new features. If you have a blog write about it, or just share it with your collegues.

requestsR - R interface to Python requests module

  •    R

R has a number of great packages for interacting with web data but it still lags behind Python in large part because of the power and ease of use of the Requests module. This package aims to port those powers to R, I like to think of this package as the Bo Jackson of web interaction tools.

instago - Download/access photos, videos, stories, story highlights, postlives, following and followers of Instagram

  •    Go

Get Instagram media (photos and videos), stories, story highlights, postlives (live stream that shared to stories after end), following and followers in Go. The following three values are must to access the Instagram API.

feedbridge - Plugin based RSS feed generator for sites that don't offer any

  •    Go

Is a tool (Hosted version / Demo: feedbridge.notmyhostna.me) to provide RSS feeds for sites that don't have one, or only offer a feed of headlines. For each site—or kind of site—you want to generate a feed for you'll have to implement a plugin with a custom scraping strategy. Feedbridge doesn't persist old items so if it's not on the site you are scraping any more it won't be in the feed. Pretty similar to how most feeds these days work that only have the latest items in there. It publishes Atom, RSS 2.0, and JSON Feed Version 1 conform feeds. There are a bunch of web apps doing something similar, some of them you can even drag and drop selectors to create a feed. That didn't work well for the site I was trying it for so I decided to built this. (Also it was fun doing so).

anirip - :clapper: A Crunchyroll show/season ripper

  •    Go

anirip is a Crunchyroll episode/subtitle ripper written in Go. It performs all actions associated with downloading video segments, subtitle files, and metadata and muxes them together appropriately. 1) Install ffmpeg if it doesn't already exist on your system. We will using this tool primarily for dumping episode content and editing video metadata.

We have large collection of open source products. Follow the tags from Tag Cloud >>

Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.