Displaying 20 to 40 from 114 results

Gigablast - Web and Enterprise search engine in C++

  •    C++

Gigablast is one of the remaining four search engines in the United States that maintains its own searchable index of over a billion pages. It is scalable to thousands of servers. Has scaled to over 12 billion web pages on over 200 servers. It supports Distributed web crawler, Document conversion, Automated data corruption detection and repair, Can cluster results from same site, Synonym search, Spell checker and lot more.

Crawler-Detect - 🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

  •    PHP

CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent and http_from header. Currently able to detect 1,000's of bots/spiders/crawlers. Run composer require jaybizzle/crawler-detect 1.* or add "jaybizzle/crawler-detect" :"1.*" to your composer.json.

ASPseek

  •    C++

ASPseek is an Internet search engine software developed by SWsoft.ASPseek consists of an indexing robot, a search daemon, and a CGI search frontend. It can index as many as a few million URLs and search for words and phrases, use wildcards, and do a Boolean search. Search results can be limited to time period given, site or Web space (set of sites) and sorted by relevance (PageRank is used) or date.

Pavuk

  •    C

Pavuk is a UNIX program used to mirror the contents of WWW documents or files. It transfers documents from HTTP, FTP, Gopher and optionally from HTTPS (HTTP over SSL) servers. Pavuk has an optional GUI based on the GTK2 widget set.




crawler - A high performance web crawler in Elixir.

  •    Elixir

A high performance web crawler in Elixir, with worker pooling and rate limiting via OPQ. Below is a very high level architecture diagram demonstrating how Crawler works.

huntsman - Super configurable async web spider

  •    Javascript

Huntsman takes one or more 'seed' urls with the spider.queue.add() method. Once the process is kicked off with spider.start(), it will take care of extracting links from the page and following only the pages we want.

glyphhanger - Your web font utility belt

  •    Javascript

Your web font utility belt. It shows what unicode-ranges are used on a web site (optionally for a font-family or for each font-family). It can also subset web fonts. It makes julienne fries. Available on npm.

BlackWidow - A Python based web application scanner to gather OSINT and fuzz for OWASP vulnerabilities on a target website

  •    Python

BlackWidow is a python based web application spider to gather subdomains, URL's, dynamic parameters, email addresses and phone numbers from a target website. This project also includes Inject-X fuzzer to scan dynamic URL's for common OWASP vulnerabilities. This software is released under the GNU General Public License v3.0. See LICENSE.md for details.


AlipayOrdersSupervisor - :sparkles: 使用Node监视支付宝订单,即时通知服务器以实现免签约支付接口

  •    Javascript

支付宝免签约支付接口实现脚本 - NodeJS 版本 . 目前支付宝已经加强了登录的校验,极大影响工具便利性,现在推出了另一种解决方案,见利用有赞云和有赞微小店实现个人收款解决方案提供一种思路参考,可以直接按此仓库使用的方法应用到自己的系统中,或使用该仓库作为一个独立的服务.

tumblr_spider - 汤不热 python 多线程爬虫

  •    Python

汤不热 python 多线程爬虫

VeryCD WebSpider - A plugin for InfoVista.NET

  •    

VeryCD is an web-spider application which can fetch the content of emule information from www.verycd.com, the result is stored as access(mdb) format, it is developed under VS2005, it is also a plugin for InfoVista.NET as a content provider.

Squzer - Distributed Web Crawler

  •    Python

Squzer is the Declum's open-source, extensible, scale, multithreaded and quality web crawler project entirely written in the Python language.

NWebCrawler

  •    

This is a web crawler program written in C#.

Spider.NET

  •    ASPNET

Crawls websites and stores content in an MS SQL or Access database. This can then be indexed and queried for a web site search.

node-readability - Scrape/Crawl article from any site automatically

  •    Javascript

In my case, the speed of spider is about 1500k documents per day, and the maximize crawling speed is 1.2k /minute, avg 1k /minute, the memory cost are about 200 MB on each spider kernel, and the accuracy is about 90%, the rest 10% can be fixed by customizing Score Rules or Selectors. it's better than any other readability modules.

grab-site - The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

  •    Python

grab-site is an easy preconfigured web crawler designed for backing up websites. Give grab-site a URL and it will recursively crawl the site and write WARC files. Internally, grab-site uses wpull for crawling. a dashboard with all of your crawls, showing which URLs are being grabbed, how many URLs are left in the queue, and more.

dhtspider - Bittorrent dht network spider

  •    Javascript

Bittorrent dht network infohash spider, for engiy.com[a bittorrent resource search engine]

portSpider - 🕷 A lightning fast multithreaded network scanner framework with modules.

  •    Python

I'm not responsible for anything you do with this program, so please only use it for good and educational purposes. Copyright (c) 2017 by David Schütz. Some rights reserved.