crawl_r - VersionEye crawlers implemented in Ruby.

  •        1

This repo contains some crawlers implemented in Ruby. First fire up the VersionEye backend services like described here.

https://www.versioneye.com
https://github.com/versioneye/crawl_r

Tags
Implementation
License
Platform

   




Related Projects

frontera - A scalable frontier for web crawlers

  •    Python

Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler. Frontera takes care of the logic and policies to follow during the crawl. It stores and prioritises links extracted by the crawler to decide which pages to visit next, and capable of doing it in distributed manner.

Crawler-Detect - 🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

  •    PHP

CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent and http_from header. Currently able to detect 1,000's of bots/spiders/crawlers. Run composer require jaybizzle/crawler-detect 1.* or add "jaybizzle/crawler-detect" :"1.*" to your composer.json.

crawlers - Some crawlers u know it:-)

  •    Python

Some crawlers u know it:-)

Dungeon Crawl Reference

  •    Lua

Dungeon Crawl Stone Soup is a free rogue-like game of exploration and treasure-hunting. Stone Soup is a continuation of Linley's Dungeon Crawl. It is openly developed and invites participation from the Crawl community. See http://crawl.develz.org !

Storm Crawler - Web crawler SDK based on Apache Storm

  •    Java

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. StormCrawler is a library and collection of resources that developers can leverage to build their own crawlers. The good news is that doing so can be pretty straightforward. Often, all you'll have to do will be to declare StormCrawler as a Maven dependency, write your own Topology class (tip : you can extend ConfigurableTopology), reuse the components provided by the project and maybe write a couple of custom ones for your own secret sauce.


Crawl

  •    C++

This is a rewrite of Linley Henzell's game Crawl in C++. Crawl is a rogue-like similar to games like Moria, Angband, and NetHack.

crawler - An easy to use, powerful crawler implemented in PHP. Can execute Javascript.

  •    PHP

This package provides a class to crawl links on a website. Under the hood Guzzle promises are used to crawl multiple urls concurrently. Because the crawler can execute JavaScript, it can crawl JavaScript rendered sites. Under the hood Chrome and Puppeteer are used to power this feature.

crawl - Dungeon Crawl: Stone Soup official repository

  •    C++

Dungeon Crawl Stone Soup is a game of dungeon exploration, combat and magic, involving characters of diverse skills, worshipping deities of great power and caprice. To win, you'll need to be a master of tactics and strategy, and prevail against overwhelming odds. There is also an ingame list of frequently asked questions which you can access by typing ?Q.

STSADM ExportCrawlLog

  •    

ExportCrawlLog is an STSADM command extension that provides the ability to export Crawl Log messages and gather summary information about crawls based on the information in the crawl log.

Sharepoint Shared Services Search Provider Property Creation

  •    

A neat little command utility that lets you do four things when moving a DB from development to production. -Accept relevant inputs from the user -Export Managed Properties -Export Crawl Properties and relevant categories -Import Managed properties and map relevant crawl ...

WebCrawler and Entity Extraction using Fetch and process frame work

  •    

Web Crawler using Fetch And Process Framework. Yes , it does processing of robots.txt

fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.

  •    Go

Package fetchbot provides a simple and flexible web crawler that follows the robots.txt policies and crawl delays.The package has a single external dependency, robotstxt. It also integrates code from the iq package.

node-readability - Scrape/Crawl article from any site automatically

  •    Javascript

In my case, the speed of spider is about 1500k documents per day, and the maximize crawling speed is 1.2k /minute, avg 1k /minute, the memory cost are about 200 MB on each spider kernel, and the accuracy is about 90%, the rest 10% can be fixed by customizing Score Rules or Selectors. it's better than any other readability modules.

yacy_grid_crawler - Crawler Microservice for the YaCy Grid

  •    Java

The Crawler is a microservices which can be deployed i.e. using Docker. When the Crawler Component is started, it searches for a MCP and connect to it. By default the local host is searched for a MCP but you can configure one yourself. Every loader and parser microservice must read this crawl profile information. Because that information is required many times, we omit a request into the cawler index by adding the crawler profile into each contract of a crawl job in the crawler_pending and loader_pending queue.

grab-site - The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

  •    Python

grab-site is an easy preconfigured web crawler designed for backing up websites. Give grab-site a URL and it will recursively crawl the site and write WARC files. Internally, grab-site uses wpull for crawling. a dashboard with all of your crawls, showing which URLs are being grabbed, how many URLs are left in the queue, and more.

pyrailgun - Simple And Easy Python Crawl Framework,支持抓取javascript渲染的页面的简单实用高效的python网页爬虫抓取模块

  •    Python

Simple And Easy Python Crawl Framework,支持抓取javascript渲染的页面的简单实用高效的python网页爬虫抓取模块

SQL-Server-R-Services-Samples - Advanced analytics samples and templates using SQL Server R Services

  •    R

In these examples, we will demonstrate how to develop and deploy end-to-end advanced analytics solutions with SQL Server 2016 R Services.Develop models in R IDE. SQL Server 2016 R services allows Data Scientists to develop solutions in an R IDE (such as RStudio, Visual Studio R Tools) with Open Source R or Microsoft R Server, using data residing in SQL Server, and computing done in-database.

swirl - :cyclone: Learn R, in R.

  •    R

swirl is a platform for learning (and teaching) statistics and R simultaneously and interactively. It presents a choice of course lessons and interactively tutors a student through them. A student may be asked to watch a video, to answer a multiple-choice or fill-in-the-blanks question, or to enter a command in the R console precisely as if he or she were using R in practice. Emphasis is on the last, interacting with the R console. User responses are tested for correctness and hints are given if appropriate. Progress is automatically saved so that a user may quit at any time and later resume without losing work. swirl leans heavily on exercising a student's use of the R console. A callback mechanism, suggested and first demonstrated for the purpose by Hadley Wickham, is used to capture student input and to provide immediate feedback relevant to the course material at hand.

lubridate - Make working with dates in R just that little bit easier

  •    R

Date-time data can be frustrating to work with in R. R commands for date-times are generally unintuitive and change depending on the type of date-time object being used. Moreover, the methods we use with date-times must be robust to time zones, leap days, daylight savings times, and other time related quirks, and R lacks these capabilities in some situations. Lubridate makes it easier to do the things R does with date-times and possible to do the things R does not. If you are new to lubridate, the best place to start is the date and times chapter in R for data science.

knitr - A general-purpose tool for dynamic report generation in R

  •    R

The R package knitr is a general-purpose literate programming engine, with lightweight API's designed to give users full control of the output without heavy coding work. It combines many features into one package with slight tweaks motivated from my everyday use of Sweave. See the package homepage for details and examples. See FAQ's for a list of frequently asked questions (including where to ask questions). Note that if you want to build the source package via R CMD INSTALL without a previously installed version of knitr, you must either pre-install knitr from CRAN, or run R CMD INSTALL on this source repo, otherwise R CMD build will fail (which is probably a bug of base R).