pholcus - [Crawler for Golang] Pholcus is a distributed, high concurrency and powerful web crawler software

  •        55

[Crawler for Golang] Pholcus is a distributed, high concurrency and powerful web crawler software.

http://pholcus.gitbooks.io/docs
https://github.com/henrylee2cn/pholcus

Tags
Implementation
License
Platform

   




Related Projects

crawler - A high performance web crawler in Elixir.

  •    Elixir

A high performance web crawler in Elixir, with worker pooling and rate limiting via OPQ. Below is a very high level architecture diagram demonstrating how Crawler works.

Open Search Server

  •    C++

Open Search Server is both a modern crawler and search engine and a suite of high-powered full text search algorithms. Built using the best open source technologies like lucene, zkoss, tomcat, poi, tagsoup. Open Search Server is a stable, high-performance piece of software.

Norconex HTTP Collector - A Web Crawler in Java

  •    Java

Norconex HTTP Collector is a web spider, or crawler that aims to make Enterprise Search integrators and developers's life easier. It is Portable, Extensible, reusable, Robots.txt support, Obtain and manipulate document metadata, Resumable upon failure and lot more.

Squzer - Distributed Web Crawler

  •    Python

Squzer is the Declum's open-source, extensible, scale, multithreaded and quality web crawler project entirely written in the Python language.

Storm Crawler - Web crawler SDK based on Apache Storm

  •    Java

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. StormCrawler is a library and collection of resources that developers can leverage to build their own crawlers. The good news is that doing so can be pretty straightforward. Often, all you'll have to do will be to declare StormCrawler as a Maven dependency, write your own Topology class (tip : you can extend ConfigurableTopology), reuse the components provided by the project and maybe write a couple of custom ones for your own secret sauce.


node-crawler - Web Crawler/Spider for NodeJS + server-side jQuery ;-)

  •    Javascript

Web Crawler/Spider for NodeJS + server-side jQuery ;-)

Norconex HTTP Collector - Enterprise Web Crawler

  •    Java

Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repositoriy of your choice (e.g. a search engine). It very flexible, powerful, easy to extend, and portable.

colly - Elegant Scraper and Crawler Framework for Golang

  •    Go

Colly provides a clean interface to write any kind of crawler/scraper/spider. With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

Crawler-Detect - 🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

  •    PHP

CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent and http_from header. Currently able to detect 1,000's of bots/spiders/crawlers. Run composer require jaybizzle/crawler-detect 1.* or add "jaybizzle/crawler-detect" :"1.*" to your composer.json.

crawler - An easy to use, powerful crawler implemented in PHP. Can execute Javascript.

  •    PHP

This package provides a class to crawl links on a website. Under the hood Guzzle promises are used to crawl multiple urls concurrently. Because the crawler can execute JavaScript, it can crawl JavaScript rendered sites. Under the hood Chrome and Puppeteer are used to power this feature.

grab-site - The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

  •    Python

grab-site is an easy preconfigured web crawler designed for backing up websites. Give grab-site a URL and it will recursively crawl the site and write WARC files. Internally, grab-site uses wpull for crawling. a dashboard with all of your crawls, showing which URLs are being grabbed, how many URLs are left in the queue, and more.

Gigablast - Web and Enterprise search engine in C++

  •    C++

Gigablast is one of the remaining four search engines in the United States that maintains its own searchable index of over a billion pages. It is scalable to thousands of servers. Has scaled to over 12 billion web pages on over 200 servers. It supports Distributed web crawler, Document conversion, Automated data corruption detection and repair, Can cluster results from same site, Synonym search, Spell checker and lot more.

Nutch - Highly extensible, highly scalable Web crawler

  •    Java

Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

webmagic - A scalable web crawler framework for Java.

  •    Java

A crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simply the development of a specific crawler.

go_spider - [爬虫框架 (golang)] An awesome Go concurrent Crawler(spider) framework

  •    Go

A crawler of vertical communities achieved by GOLANG. Latest stable Release: Version 1.2 (Sep 23, 2014).

Ex-Crawler

  •    Java

Ex-Crawler is divided into 3 subprojects (Crawler Daemon, distributed gui Client, (web) search engine) which together provide a flexible and powerful search engine supporting distributed computing. More informations: http://ex-crawler.sourceforge.net

headless-chrome-crawler - Distributed crawler powered by Headless Chrome

  •    Javascript

Crawlers based on simple requests to HTML files are generally fast. However, it sometimes ends up capturing empty bodies, especially when the websites are built on such modern frontend frameworks as AngularJS, React and Vue.js. Note: headless-chrome-crawler contains Puppeteer. During installation, it automatically downloads a recent version of Chromium. To skip the download, see Environment variables.

Scrapy - Web crawling & scraping framework for Python

  •    Python

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Arachnode.net

  •    CSharp

An open source .NET web crawler written in C# using SQL 2005/2008. Arachnode.net is a complete and comprehensive .NET web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages.

gocrawl - Polite, slim and concurrent web crawler.

  •    Go

gocrawl is a polite, slim and concurrent web crawler written in Go.For a simpler yet more flexible web crawler written in a more idiomatic Go style, you may want to take a look at fetchbot, a package that builds on the experience of gocrawl.