Displaying 1 to 20 from 24 results

gain - Web crawling framework based on asyncio.

  •    Python

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp. You can add proxy setting to spider as above.

Norconex HTTP Collector - Enterprise Web Crawler

  •    Java

Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repositoriy of your choice (e.g. a search engine). It very flexible, powerful, easy to extend, and portable.

ASPseek

  •    C++

ASPseek is an Internet search engine software developed by SWsoft.ASPseek consists of an indexing robot, a search daemon, and a CGI search frontend. It can index as many as a few million URLs and search for words and phrases, use wildcards, and do a Boolean search. Search results can be limited to time period given, site or Web space (set of sites) and sorted by relevance (PageRank is used) or date.




Pavuk

  •    C

Pavuk is a UNIX program used to mirror the contents of WWW documents or files. It transfers documents from HTTP, FTP, Gopher and optionally from HTTPS (HTTP over SSL) servers. Pavuk has an optional GUI based on the GTK2 widget set.

mnoGoSearch

  •    C

mnoGoSearch for UNIX consists of a command line indexer and a search program which can be run under Apache Web Server, or any other HTTP server supporting CGI interface. mnoGoSearch for Unix is distributed in sources and can be compiled with a number of databases, depending on user's choice. It is known to work on a wide variety of the modern Unix operating systems including Linux, FreeBSD, Mac OSX, Solaris and others.

FileMasta - Search servers for video, music, books, software, games, subtitles and much more

  •    CSharp

FileMasta is a search engine allowing you to find a file among millions of files located on FTP-servers. The search engine database contains the regularly updated information on the contents of thousands FTP-servers worldwide. We don't search the contents of the files. We host no content, we provide only access to already available files in the same way Google and other search engines do.

NewPipeExtractor - Core part of NewPipe

  •    Java

NewPipe Extractor is a library for extracting things from streaming sites. It is a core component of NewPipe, but could be used independently.NewPipe Extractor is available at JitPack's Maven repo.


freshonions-torscraper - Fresh Onions is an open source TOR spider / hidden service onion crawler hosted at zlal32teyptf4tvi

  •    Python

This is a copy of the source for the http://zlal32teyptf4tvi.onion hidden service, which implements a tor hidden service crawler / spider and web site. This software is made available under the GNU Affero GPL 3 License.. What this means is that is you deploy this software as part of networked software that is available to the public, you must make the source code available (and any modifications).

sandcrawler - sandcrawler.js - the server-side scraping companion.

  •    Javascript

sandcrawler.js is a node library aiming at providing developers with concise but exhaustive tools to scrape the web. Disclaimer: this library is an unreleased work in progress.

rovers - Rovers is a service to retrieve repository URLs from multiple repository hosting providers.

  •    HTML

rovers is a service to retrieve repository URLs from multiple repository hosting providers. Install docker-compose.

andvaranaut - The dungeon crawler

  •    C

See the Makefile for instructions on how to bulid for Linux, MacOS, and Windows. Item art by Platino.

sentry - Parallelized web crawler written in Golang

  •    Go

Sentry is a parallelized web crawler written in Go that writes urls, links, & response headers to a Postgres database, then stores the response itself on amazon S3. It keeps a list of “sources”, which use simple string comparison to keep it from wandering outside of designated domains or url paths. The big difference from other crawlers is a tunable “stale duration”, which will tell the crawler to capture an updated snapshot of the page if the time since the last GET request is older than the stale duration. This gives it a continual “watching” property.

crawler - Nodejs crawler for cnbeta.com

  •    Javascript

Nodejs crawler for cnbeta.com, The source code is on Github.

USTBCrawlers - 那些年,我爬过的北科。一个由浅入深的定向爬虫教程。

  •    Python

那些年,我爬过的北科。一个由浅入深的定向爬虫教程。

od-database-crawler - OD-Database Go crawler

  •    Go

Here are the most important config flags. For more fine control, take a look at /config.yml.

crawler - Libraries and scripts for crawling the TYPO3 page tree

  •    PHP

Libraries and scripts for crawling the TYPO3 page tree. Used for re-caching, re-indexing, publishing applications etc. Please see the Wiki Pages for Release notes and Known issues.

XSRFProbe - An CSRF Scanner Equipped with Powerful Crawling Engine and Intelligent Token Generator.

  •    Python

XSRF Probe is an advanced Cross Site Request Forgery Audit Toolkit equipped with Powerful Crawling and Intelligent Token Generation Capabilities. It is because this tool is designed to perform all kinds of form submissions automatically which can sabotage the site. Sometimes you may screw up the database and most probably perform a DoS on the site as well.