scrapy-training - Scrapy Training companion code

  •        22

This repository contains the companion files for the "Crawling the Web with Scrapy" training program. You can either clone it using git or download the files from this zip file. Contact us here if you (or your company) are interested in Scrapy training and coaching sessions.

https://github.com/scrapinghub/scrapy-training

Tags
Implementation
License
Platform

   




Related Projects

Scrapy - Web crawling & scraping framework for Python

  •    Python

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

awesome-scrapy - A curated list of awesome packages, articles, and other cool resources from the Scrapy community

  •    

A curated list of awesome packages, articles, and other cool resources from the Scrapy community. Scrapy is a fast high-level web crawling & scraping framework for Python. scrapyscript Run a Scrapy spider programmatically from a script or a Celery task - no project required.

scrapy-redis - Redis-based components for scrapy that allows distributed crawling

  •    Python

Redis-based components for scrapy that allows distributed crawling

scrapy-zhihu-github - scrapy examples for crawling zhihu and github

  •    Python

scrapy examples for crawling zhihu and github

scrapy-proxies - Random proxy middleware for Scrapy

  •    Python

Processes Scrapy requests using a random proxy from list to avoid IP ban and improve crawling speed. For older versions of Scrapy (before 1.0.0) you have to use scrapy.contrib.downloadermiddleware.retry.RetryMiddleware and scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware middlewares instead.


ferret - Declarative web scraping

  •    Go

ferret is a web scraping system. It aims to simplify data extraction from the web for UI testing, machine learning, analytics and more. ferret allows users to focus on the data. It abstracts away the technical details and complexity of underlying technologies using its own declarative language. It is extremely portable, extensible, and fast. It as the ability to scrape JS rendered pages, handle all page events and emulate user interactions.

portia - Visual scraping for Scrapy

  •    Python

Portia is a tool that allows you to visually scrape websites without any programming knowledge required. With Portia you can annotate a web page to identify the data you wish to extract, and Portia will understand based on these annotations how to scrape data from similar pages. For more detailed instructions, and alternatives to using Docker, see the Installation docs.

scrapyrt - Scrapy realtime

  •    Python

HTTP server which provides API for scheduling Scrapy spiders and making requests with spiders. Allows you to easily add HTTP API to your existing Scrapy project. All Scrapy project components (e.g. middleware, pipelines, extensions) are supported out of the box. You simply run Scrapyrt in Scrapy project directory and it starts HTTP server allowing you to schedule your spiders and get spider output in JSON format.

ruia - Async Python 3.6+ web scraping micro-framework based on asyncio.

  •    Python

Ruia is an async web scraping micro-framework, written with asyncio and aiohttp, aims to make crawling url as convenient as possible.

scrapy-redis - Redis-based components for Scrapy.

  •    Python

Redis-based components for Scrapy. You can start multiple spider instances that share a single redis queue. Best suitable for broad multi-domain crawls.

hotel-review-analysis - Sentiment analysis and aspect classification for hotel reviews using machine learning models with MonkeyLearn

  •    Python

This is the source code of MonkeyLearn's series of posts related to analyzing sentiment and aspects from hotel reviews using machine learning models. This code runs in python2.7. The project itself is a Scrapy project that is used to gather training and testing data from different sites like TripAdvisor and Booking. Besides, there are a series of Python scripts and Jupyter notebooks that implement some necessary scripts.

Web App Security Training Movies

  •    

Intended for Developers to highlight their security weak coding and show them how attackers can abuse these weaknesses. Refer to the following web sites for directly viewing training movies online. http://yehg.net/lab/#training http://core.yehg.net/lab/#training

scrapyd - A service daemon to run Scrapy spiders

  •    Python

Scrapyd is a service for running Scrapy spiders. It allows you to deploy your Scrapy projects and control their spiders using an HTTP JSON API.

scrapy-examples - Multifarious Scrapy examples

  •    Python

Multifarious scrapy examples with integrated proxies and agents, which make you comfy to write a spider. There are several depths in the spider, and the spider gets real data from depth2.

grab - Web Scraping Framework

  •    Python

Project Grab is not abandoned but it is not being actively developed. At current time I am working on another crawling framework which I want to be simple, fast and does not leak memory. New project is located here: https://github.com/lorien/crawler First, I've tried to use mix of asyncio (network) and classic threads (parsing HTML with lxml on multiple CPU cores) but then I've decided to use classic threads for everything for the sake of simplicity. Network requests are processed with pycurl because it is fast, feature-rich and supports socks5 proxies. You can try new framework but be aware it does not have many features yet. In particular, its options to configure network requests are very pure. If you need some option, feel free to create new issue.

scrapely - A pure-python HTML screen-scraping library

  •    HTML

Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages. Scrapinghub wrote a nice blog post explaining how scrapely works and how it's used in Portia.

nauta - A multi-user, distributed computing environment for running DL model training experiments on Intel® Xeon® Scalable processor-based systems

  •    Python

The Nauta software provides a multi-user, distributed computing environment for running deep learning model training experiments. Results of experiments, can be viewed and monitored using a command line interface, web UI and/or TensorBoard*. You can use existing data sets, use your own data, or downloaded data from online sources, and create public or private folders to make collaboration among teams easier. Nauta runs using the industry leading Kubernetes* and Docker* platform for scalability and ease of management. Template packs for various DL frameworks and tooling are available (and customizable) on the platform to take the complexities out of creating and running single and multi-node deep learning training experiments without all the systems overhead and scripting needed with standard container environments.






We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.