Displaying 1 to 15 from 15 results

Scrapy - Web crawling & scraping framework for Python


Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

dht - BitTorrent DHT Protocol && DHT Spider.


See the video on the Youtube.It contains two modes, the standard mode and the crawling mode. The standard mode follows the BEPs, and you can use it as a standard dht server. The crawling mode aims to crawl as more metadata info as possiple. It doesn't follow the standard BEPs protocol. With the crawling mode, you can build another BTDigg.

colly - Fast and Elegant Scraping Framework for Gophers


Colly provides a clean interface to write any kind of crawler/scraper/spider.With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.

Norconex HTTP Collector - Enterprise Web Crawler


Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repositoriy of your choice (e.g. a search engine). It very flexible, powerful, easy to extend, and portable.




Yioop - Open Source Search Engine Software


Yioop is an open source, PHP search engine capable of crawling, index, and providing search results for hundred of millions of pages on relatively low end hardware. It can index a variety of text formats HTML, RSS, PDF, RTF, DOC and images GIF, JPEG, PNG, etc. It can import data from ARC, WARC, Media-Wiki, Open Directory RDF. It is easily localized to many languages. It has built-in support for new feeds, discussion groups, blogs, and wikis. It also supports mixing indexes to create mash ups.

Gigablast - Web and Enterprise search engine in C++


Gigablast is one of the remaining four search engines in the United States that maintains its own searchable index of over a billion pages. It is scalable to thousands of servers. Has scaled to over 12 billion web pages on over 200 servers. It supports Distributed web crawler, Document conversion, Automated data corruption detection and repair, Can cluster results from same site, Synonym search, Spell checker and lot more.

Grub


Grub Next Generation is distributed web crawling system (clients/servers) which helps to build and maintain index of the Web. It is client-server architecture where client crawls the web and updates the server. The peer-to-peer grubclient software crawls during computer idle time.

Open Search Server


Open Search Server is both a modern crawler and search engine and a suite of high-powered full text search algorithms. Built using the best open source technologies like lucene, zkoss, tomcat, poi, tagsoup. Open Search Server is a stable, high-performance piece of software.



ASPseek


ASPseek is an Internet search engine software developed by SWsoft.ASPseek consists of an indexing robot, a search daemon, and a CGI search frontend. It can index as many as a few million URLs and search for words and phrases, use wildcards, and do a Boolean search. Search results can be limited to time period given, site or Web space (set of sites) and sorted by relevance (PageRank is used) or date.

Pavuk


Pavuk is a UNIX program used to mirror the contents of WWW documents or files. It transfers documents from HTTP, FTP, Gopher and optionally from HTTPS (HTTP over SSL) servers. Pavuk has an optional GUI based on the GTK2 widget set.

VeryCD WebSpider - A plugin for InfoVista.NET


VeryCD is an web-spider application which can fetch the content of emule information from www.verycd.com, the result is stored as access(mdb) format, it is developed under VS2005, it is also a plugin for InfoVista.NET as a content provider.

Squzer - Distributed Web Crawler


Squzer is the Declum's open-source, extensible, scale, multithreaded and quality web crawler project entirely written in the Python language.

NWebCrawler


This is a web crawler program written in C#.

Spider.NET


Crawls websites and stores content in an MS SQL or Access database. This can then be indexed and queried for a web site search.