slinky - web crawler just for links

  •        2

Slinky is a web crawler, but just for the links between webpages. Slinky is intended to be used to visualize the routes and structure behind a website by collecting hyperlinks. If you decide to print out the source code and drop it down a flight of stairs, you may not be disappointed either.

https://github.com/andrejewski/slinky

Dependencies:

defaults : ^1.0.0
superagent : ^0.18.2

Tags
Implementation
License
Platform

   




Related Projects

Nutch - Highly extensible, highly scalable Web crawler

  •    Java

Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

dcrawl - Simple, but smart, multi-threaded web crawler for randomly gathering huge lists of unique domain names

  •    Go

dcrawl is a simple, but smart, multi-threaded web crawler for randomly gathering huge lists of unique domain names. dcrawl takes one site URL as input and detects all <a href=...> links in the site's body. Each found link is put into the queue. Successively, each queued link is crawled in the same way, branching out to more URLs found in links on each site's body.

Pavuk

  •    C

Pavuk is a UNIX program used to mirror the contents of WWW documents or files. It transfers documents from HTTP, FTP, Gopher and optionally from HTTPS (HTTP over SSL) servers. Pavuk has an optional GUI based on the GTK2 widget set.


Mvc Xml Sitemap

  •    

MVC Sitemap makes it a snap for your ASP.NET MVC based web site to expose a sitemap xml file to search engine crawlers. Simply place a [Sitemap] attribute on all Actions you want crawled and create an action for the sitemap - it's that easy.

link - A repository for the PSR-13 [Hyperlink] interface

  •    PHP

This repository holds all interfaces/classes/traits related to PSR-13. Note that this is not an HTTP link implementation of its own. It is merely an interface that describes an HTTP link. See the specification for more details.

web-scraper-chrome-extension - Web data extraction tool implemented as chrome extension

  •    Javascript

Web Scraper is a chrome browser extension built for data extraction from web pages. Using this extension you can create a plan (sitemap) how a web site should be traversed and what should be extracted. Using these sitemaps the Web Scraper will navigate the site accordingly and extract all data. Scraped data later can be exported as CSV. When submitting a bug please attach an exported sitemap if possible.

broken-link-checker - Find broken links, missing images, etc in your HTML.

  •    Javascript

Find broken links, missing images, etc in your HTML. Node.js >= 0.10 is required; < 4.0 will need Promise and Object.assign polyfills.

Norconex HTTP Collector - A Web Crawler in Java

  •    Java

Norconex HTTP Collector is a web spider, or crawler that aims to make Enterprise Search integrators and developers's life easier. It is Portable, Extensible, reusable, Robots.txt support, Obtain and manipulate document metadata, Resumable upon failure and lot more.

Storm Crawler - Web crawler SDK based on Apache Storm

  •    Java

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. StormCrawler is a library and collection of resources that developers can leverage to build their own crawlers. The good news is that doing so can be pretty straightforward. Often, all you'll have to do will be to declare StormCrawler as a Maven dependency, write your own Topology class (tip : you can extend ConfigurableTopology), reuse the components provided by the project and maybe write a couple of custom ones for your own secret sauce.

gocrawl - Polite, slim and concurrent web crawler.

  •    Go

gocrawl is a polite, slim and concurrent web crawler written in Go.For a simpler yet more flexible web crawler written in a more idiomatic Go style, you may want to take a look at fetchbot, a package that builds on the experience of gocrawl.

crawler - A high performance web crawler in Elixir.

  •    Elixir

A high performance web crawler in Elixir, with worker pooling and rate limiting via OPQ. Below is a very high level architecture diagram demonstrating how Crawler works.

SimpleSiteMenu - A nested SiteMap UL list

  •    ASPNET

Simple nested UL/LI emitting composite web control which you can bind to a SiteMap provider. I have provided some basic CSS and jQuery scripts to style it into a tree view. Code has been derived from this sample http://bryantlikes.com/archive/2006/02/17/4839.aspx

Arachnode.net

  •    CSharp

An open source .NET web crawler written in C# using SQL 2005/2008. Arachnode.net is a complete and comprehensive .NET web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages.

sitemap-php - Library for generating Google sitemap XML files

  •    PHP

For the 90's people, i'm keeping this repository as 5.2 compatible. If you need PSR-0 and Composer compatible version, here is a fork that maintained by Evert Pot. Include Sitemap.php file to your PHP document and call Sitemap class with your base domain.

webmagic - A scalable web crawler framework for Java.

  •    Java

A crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simply the development of a specific crawler.

sitemap.js - Sitemap-generating framework for node.js

  •    Javascript

sitemap.js is a high-level sitemap-generating framework that makes creating sitemap XML files easy.Description specifications. Required fields are thumbnail_loc, title, and description.

sitemap - Sitemap and sitemap index builder

  •    PHP

Sitemap and sitemap index builder. After that, make sure your application autoloads Composer classes by including vendor/autoload.php.

node-crawler - Web Crawler/Spider for NodeJS + server-side jQuery ;-)

  •    Javascript

Web Crawler/Spider for NodeJS + server-side jQuery ;-)