x-ray - The next web scraper. See through the <html> noise.

  •        63

Looking for a career upgrade? Check out the available Node.js & Javascript positions at these innovative companies.Flexible schema: Supports strings, arrays, arrays of objects, and nested object structures. The schema is not tied to the structure of the page you're scraping, allowing you to pull the data in the structure of your choosing.

https://github.com/lapwinglabs/x-ray#readme

Dependencies:

batch : ~0.5.2
bluebird : ^3.4.7
chalk : ~1.1.1
cheerio : ~0.20.0
debug : ~2.2.0
enstore : ~1.0.1
is-url : ~1.2.0
isobject : ~2.0.0
object-assign : ~4.0.1
stream-to-string : ^1.1.0
x-ray-crawler : ~2.0.1
x-ray-parse : ~1.0.1

Tags
Implementation
License
Platform

   




Related Projects

wring - Extract content from webpages using CSS Selectors, XPath, and JS expressions

  •    PureScript

Wring utilizes PhantomJS for some of its commands. To use these, install it using your system package manager by running something like brew install phantomjs on OS X, or apt-get install phantomjs on Ubuntu. You can make sure it's on your PATH by running phantomjs -v.

cheerio - Fast, flexible, and lean implementation of core jQuery designed specifically for the server

  •    Javascript

❤ Familiar syntax: Cheerio implements a subset of core jQuery. Cheerio removes all the DOM inconsistencies and browser cruft from the jQuery library, revealing its truly gorgeous API.ϟ Blazingly fast: Cheerio works with a very simple, consistent DOM model. As a result parsing, manipulating, and rendering are incredibly efficient. Preliminary end-to-end benchmarks suggest that cheerio is about 8x faster than JSDOM.

MatthewMueller-cheerio

  •    CoffeeScript

Fast, flexible, and lean implementation of core jQuery designed specifically for the server.

scrape-it - :crystal_ball: A Node.js scraper for humans.

  •    Javascript

A Node.js scraper for humans. Please post questions on Stack Overflow. You can open issues with questions, as long you add a link to your Stack Overflow question.


app-store-scraper - scrape data from the itunes app store

  •    Javascript

Node.js module to scrape application data from the iTunes/Mac App Store. The goal is to provide an interface as close as possible to the google-play-scraper module.

node-google - A Node.js module to search and scrape Google.

  •    Javascript

This module allows you to search google by scraping the results. It does NOT use the Google Search API. PLEASE DO NOT ABUSE THIS. The intent of using this is convenience vs the cruft that exists in the Google Search API.This is not sponsored, supported, or affiliated with Google Inc.

WebExtractor360 - Open Source Web Extractor

  •    

WebExtractor360 is a free and open source web data extractor. It uses Regular Expressions to find, extract and scrape internet data quickly and easily. It is very flexible, allowing you to extract both simple and commonly used data and complex data structures like HTML tables.

malsub - A Python RESTful API framework for online malware analysis and threat intelligence services

  •    Python

malsub is a Python 3.6.x framework that wraps several web services of online malware and URL analysis sites through their RESTful Application Programming Interfaces (APIs). It supports submitting files or URLs for analysis, retrieving reports by hash values, domains, IPv4 addresses or URLs, downloading samples and other files, making generic searches and getting API quota values. The framework is designed in a modular way so that new services can be added with ease by following the provided template module and functions to make HTTP GET and POST requests and to pretty print results. This approach avoids having to write individual and specialized wrappers for each and every API by leveraging what they have in common in their calls and responses. The framework is also multi-threaded and dispatches service API functions across a thread pool for each input argument, meaning that it spawns a pool of threads per each file provided for submission or per each hash value provided for report retrieval, for example. Most of these services require API keys that are generated after registering an account in their respective websites, which need to be specified in the apikey.yaml file according to the given structure. Note that some of the already bundled services are limited in supported operations due to the fact that they were developed with free API keys. API keys associated with paid subscriptions are allowed to make additional calls not open to the public and may not be restricted by a given quota. Yet, malsub can process multiple input arguments and pause between requests as a workaround for cooldown periods.

node-web-scraper - Code for the tutorial: Scraping the Web With Node.js by @kukicado

  •    Javascript

Then it will start up our node server, navigate to http://localhost:8081/scrape and see what happens.

facebook-page-post-scraper - Data scraper for Facebook Pages, and also code accompanying the blog post How to Scrape Data From Facebook Page Posts for Statistical Analysis

  •    Python

UPDATE December 2017: Due to a bug on Facebook's end, using this scraper will only return a very small subset of posts (5-10% of posts) over a limited timeframe. Since Facebook now owns CrowdTangle, the (paid) canonical source of historical Facebook data, Facebook doesn't have an incentive to fix the linked bug. On December 12th, a Facebook engineer commented that they are developing a new endpoint for scraping posts chronologically. I will refactor this script once that happens. Until then, there likely will not be any PRs accepted.

X-Ray Engine Toolset

  •    

X-Ray unofficial toolset for complex use with official S.T.A.L.K.E.R. MOD SDK. The code to load/save X-Ray files closely follows the GSC one. Regarding the rest source code, you can do whatever you want, just do not say you wrote it.

OpenTheatre - Search movies, series, anime, subtitles, torrents and archives from open directories

  •    CSharp

OpenTheatre is a program which allows users to search for Movies, TV Series, Anime, Subtitles, Torrents and Archives. The program communicates with its own API written entirely using our custom built command-line web crawler designed to scrape information from trusted files which are updated every day. The public web resources used are available on our open assets database, where anyone can contribute their open directories. OpenTheatre works to query movies, series, anime, subtitles, torrents and archives from all around the web to provide you with the ultimate streaming experience.

micro-open-graph - A tiny Node.js microservice to scrape open graph data with joy.

  •    Javascript

A tiny Node.js microservice to scrape open graph data with joy. The server will then be listening at localhost:3000.

web-scraper-chrome-extension - Web data extraction tool implemented as chrome extension

  •    Javascript

Web Scraper is a chrome browser extension built for data extraction from web pages. Using this extension you can create a plan (sitemap) how a web site should be traversed and what should be extracted. Using these sitemaps the Web Scraper will navigate the site accordingly and extract all data. Scraped data later can be exported as CSV. When submitting a bug please attach an exported sitemap if possible.

scraperjs - A complete and versatile web scraper.

  •    Javascript

Scraperjs is a web scraper module that make scraping the web an easy job. Try to spot the differences.

pySpec

  •    

The pySpec project is a set of data analysis routines written in python for analysis of x-ray diffraction data produced by the SPEC X-Ray Diffraction and Data Acquisition software.

WebApi - OData Web API: A server library built upon ODataLib and WebApi

  •    CSharp

OData Web API (i.e., ASP.NET Web API OData) is a server library built upon ODataLib and Web API. This is the active development branch for OData WebApi and it is currently most actively iterated. The package name is Microsoft.AspNet.OData. The is the OData WebApi for ODL v7.x releases which contain breaking changes against ODL v6.