gocrawl - Polite, slim and concurrent web crawler.

  •        107

gocrawl is a polite, slim and concurrent web crawler written in Go.For a simpler yet more flexible web crawler written in a more idiomatic Go style, you may want to take a look at fetchbot, a package that builds on the experience of gocrawl.

https://github.com/PuerkitoBio/gocrawl

Tags
Implementation
License
Platform

   




Related Projects

fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.

  •    Go

Package fetchbot provides a simple and flexible web crawler that follows the robots.txt policies and crawl delays.The package has a single external dependency, robotstxt. It also integrates code from the iq package.

Heritrix

  •    Java

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix is designed to respect the robots.txt exclusion directives and META robots tags, and collect material at a measured, adaptive pace unlikely to disrupt normal website activity.

agora - a dynamically typed, garbage collected, embeddable programming language built with Go

  •    Go

Agora is a dynamically typed, garbage collected, embeddable programming language. It is built with the Go programming language, and is meant to provide a syntactically similar, loose and dynamic companion to the statically typed, machine compiled Go language - somewhat like Lua is to C. go get -t github.com/PuerkitoBio/agora/...

WCF Data Service Format Extensions for CSV, TXT

  •    

This project add support for Legacy formats like CSV, TXT (CSV Export) to the data service output and allow $format=txt query. By default WCF Data Services support Atom and JSON responses however legacy systems do not understand ATOM or JSON but they understand CSV, TXT f...


James - Enterprise Mail Server

  •    Java

James (a.k.a Java Apache Mail Enterprise Server) is a 100% pure Java SMTP and POP3 Mail server, and NNTP News server designed to be a complete and portable enterprise mail/messaging engine solution based on currently available open messaging protocols.

crawler - A high performance web crawler in Elixir.

  •    Elixir

A high performance web crawler in Elixir, with worker pooling and rate limiting via OPQ. Below is a very high level architecture diagram demonstrating how Crawler works.

webmagic - A scalable web crawler framework for Java.

  •    Java

A crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simply the development of a specific crawler.

C Steganography

  •    Ruby

This software is a set of tools that hides C source code in txt files. Also the software is able to restore the txt files to C source code again. This work is based on the tool c2txt2c by Leevi Martilla. Csteg needs a book file in txt format to hide C so

Cmic Reader

  •    

Cmic Reader is a reader allows you to download txt files form SkyDrive and read the txt files on windows phone 7. It support to read txt files in English and east Asia languages for example Chinese or Japanese in UTF 8 encoding.

django-robots - A Django app for managing robots.txt files following the robots exclusion protocol

  •    Python

A Django app for managing robots.txt files following the robots exclusion protocol

Cycle - CYbot Control LanguagE

  •    C

The CYbot Control LanguagE (or Cycle for short) is a Java-like language for programming Ultimate Real Robots' Cybot and TOM robots, with an open source compiler which produces files which can be loaded into Real Robots software for testing and downloadi

django-robots - A Django app for managing robots.txt files following the robots exclusion protocol

  •    Python

This is a basic Django application to manage robots.txt files following the robots exclusion protocol, complementing the Django Sitemap contrib app.

StockSharp - Algorithmic trading and quantitative trading open source platform to develop trading robots (stock markets, forex, bitcoins and options)

  •    CSharp

StockSharp (shortly S#) – are free set of programs for trading at any markets of the world (American, European, Asian, Russian, stocks, futures, options, Bitcoins, forex, etc.). You will be able to trade manually or automated trading (algorithmic trading robots, conventional or HFT).Available connections: FIX/FAST, LMAX, Rithmic, Fusion/Blackwood, Interactive Brokers, OpenECry, Sterling, IQFeed, ITCH, FXCM, QuantHouse, E*Trade, BTCE, BitStamp and many other. Any broker or partner broker (benefits).

Zbots- Battling breeding robots

  •    C++

A robot wars variant. A computer game in which robots are governed by a simple assembly language. Various arenas running on different hosts can be connected over the internet, so that robots blundering into transporters can be sent from machine to machi

sphero.js - The Sphero JavaScript SDK to control Sphero robots.

  •    Javascript

The official Orbotix JavaScript SDK module to programmatically control Sphero robots. The BB-8 and Ollie use a Bluetooth Low Energy (LE) interface, also known as "Bluetooth Smart" or "Bluetooth 4.0/4.1". You must have a hardware adapter that supports the Bluetooth 4.x+ standard to connect your computer to your BB-8 or Ollie.

Introduction-to-Autonomous-Robots - Introduction to Autonomous Robots

  •    Mathematica

Nikolaus Correll. Introduction to Autonomous Robots, 2nd edition, Magellan Scientific, 2016.

Norconex HTTP Collector - A Web Crawler in Java

  •    Java

Norconex HTTP Collector is a web spider, or crawler that aims to make Enterprise Search integrators and developers's life easier. It is Portable, Extensible, reusable, Robots.txt support, Obtain and manipulate document metadata, Resumable upon failure and lot more.

yacy_grid_crawler - Crawler Microservice for the YaCy Grid

  •    Java

The Crawler is a microservices which can be deployed i.e. using Docker. When the Crawler Component is started, it searches for a MCP and connect to it. By default the local host is searched for a MCP but you can configure one yourself. Every loader and parser microservice must read this crawl profile information. Because that information is required many times, we omit a request into the cawler index by adding the crawler profile into each contract of a crawl job in the crawler_pending and loader_pending queue.

Crawler-Detect - 🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

  •    PHP

CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent and http_from header. Currently able to detect 1,000's of bots/spiders/crawlers. Run composer require jaybizzle/crawler-detect 1.* or add "jaybizzle/crawler-detect" :"1.*" to your composer.json.