Norconex HTTP Collector - A Web Crawler in Java

  •        0

Norconex HTTP Collector is a web spider, or crawler that aims to make Enterprise Search integrators and developers's life easier. It is Portable, Extensible, reusable, Robots.txt support, Obtain and manipulate document metadata, Resumable upon failure and lot more.

www.norconex.com/product/collector-http/

Tags
Implementation
License
Platform

   




Related Projects

Norconex HTTP Collector - Enterprise Web Crawler


Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repositoriy of your choice (e.g. a search engine). It very flexible, powerful, easy to extend, and portable.

Open Search Server


Open Search Server is both a modern crawler and search engine and a suite of high-powered full text search algorithms. Built using the best open source technologies like lucene, zkoss, tomcat, poi, tagsoup. Open Search Server is a stable, high-performance piece of software.

Gigablast - Web and Enterprise search engine in C++


Gigablast is one of the remaining four search engines in the United States that maintains its own searchable index of over a billion pages. It is scalable to thousands of servers. Has scaled to over 12 billion web pages on over 200 servers. It supports Distributed web crawler, Document conversion, Automated data corruption detection and repair, Can cluster results from same site, Synonym search, Spell checker and lot more.

Search-Engine-Web-Crawler - Search engine, web crawler, and index maker in Java.


Search engine, web crawler, and index maker in Java.

ASPseek


ASPseek is an Internet search engine software developed by SWsoft.ASPseek consists of an indexing robot, a search daemon, and a CGI search frontend. It can index as many as a few million URLs and search for words and phrases, use wildcards, and do a Boolean search. Search results can be limited to time period given, site or Web space (set of sites) and sorted by relevance (PageRank is used) or date.

search-engine - created a web search engine using lucene web crawler.


created a web search engine using lucene web crawler.

Ex-Crawler


Ex-Crawler is divided into 3 subprojects (Crawler Daemon, distributed gui Client, (web) search engine) which together provide a flexible and powerful search engine supporting distributed computing. More informations: http://ex-crawler.sourceforge.net

Pavuk


Pavuk is a UNIX program used to mirror the contents of WWW documents or files. It transfers documents from HTTP, FTP, Gopher and optionally from HTTPS (HTTP over SSL) servers. Pavuk has an optional GUI based on the GTK2 widget set.

Squzer - Distributed Web Crawler


Squzer is the Declum's open-source, extensible, scale, multithreaded and quality web crawler project entirely written in the Python language.

crawler - I needed a serious web crawler for search engine applications. This is it.


I needed a serious web crawler for search engine applications. This is it.

web-crawler - A web crawler/spider using python


A web crawler/spider using python

Arachnode.net


An open source .NET web crawler written in C# using SQL 2005/2008. Arachnode.net is a complete and comprehensive .NET web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages.

Grub


Grub Next Generation is distributed web crawling system (clients/servers) which helps to build and maintain index of the Web. It is client-server architecture where client crawls the web and updates the server. The peer-to-peer grubclient software crawls during computer idle time.

node-search-engine - Sample search engine with web crawler, built on Node.js + CouchDB + Limestone


Sample search engine with web crawler, built on Node.js + CouchDB + Limestone

Node-Web-Spider - A web Spider (Crawler) that follows all the links with a domain name


A web Spider (Crawler) that follows all the links with a domain name

WWW-Crawler-Lite - A single-threaded crawler/spider for the web.


A single-threaded crawler/spider for the web.

Crawler - web Crawler\Spider


web Crawler\Spider

node-crawler - Web Crawler/Spider for NodeJS + server-side jQuery ;-)


Web Crawler/Spider for NodeJS + server-side jQuery ;-)

ragno - Web spider/crawler written in ruby ('ragno' is Italian for 'spider')


Web spider/crawler written in ruby ('ragno' is Italian for 'spider')

CubeRoot - Web Crawler, Indexer, and Search Engine in python


Web Crawler, Indexer, and Search Engine in python