WebCrawler and Entity Extraction using Fetch and process frame work

  •        0

Web Crawler using Fetch And Process Framework. Yes , it does processing of robots.txt

http://crawlandextract.codeplex.com/

Tags
Implementation
License
Platform

   

comments powered by Disqus


Related Projects

HAProxy - The Reliable, High Performance TCP/HTTP Load Balancer


HAProxy is a fast and reliable solution offering high availability, load balancing, and proxying for TCP and HTTP-based applications. It is particularly suited for web sites crawling under very high loads while needing persistence or Layer7 processing. Supporting tens of thousands of connections is clearly realistic with todays hardware.

Arachnode.net


An open source .NET web crawler written in C# using SQL 2005/2008. Arachnode.net is a complete and comprehensive .NET web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages.

Grub


Grub Next Generation is distributed web crawling system (clients/servers) which helps to build and maintain index of the Web. It is client-server architecture where client crawls the web and updates the server. The peer-to-peer grubclient software crawls during computer idle time.

Crawltrack - Tracks the visits of Crawler


CrawlTrack is a web analytics tool. It could track the visitors, referrer, keyword used, country of origin, page views etc. The main advantage of this tool is it could track and record the information of crawler activity. It also blocks the attacks done by hackers.

TextMine


TextMine is for the Perl hacker who is grappling with the problems of managing unstructured text from various sources. You can use these text mining tools to search the Web, index text, extract entities, categorize your e-mail, and summarize documents.

Yioop - Open Source Search Engine Software


Yioop is an open source, PHP search engine capable of crawling, index, and providing search results for hundred of millions of pages on relatively low end hardware. It can index a variety of text formats HTML, RSS, PDF, RTF, DOC and images GIF, JPEG, PNG, etc. It can import data from ARC, WARC, Media-Wiki, Open Directory RDF. It is easily localized to many languages. It has built-in support for new feeds, discussion groups, blogs, and wikis. It also supports mixing indexes to create mash ups.

YaCy - Decentralized Web Search


YaCy (read "ya see") is a free distributed search engine, built on principles of peer-to-peer (P2P) networks. It is distributed on several hundred computers so-called YaCy-peers. Each YaCy-peer independently crawls through the Internet, analyzes and indexes found web pages, and stores indexing results in a common database which is shared with other YaCy-peers using principles of P2P networks.

SQLSentinel


OpenSource tool for sql injection security testing

Aperture - Java framework for getting data and metadata


Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems. It could crawl and extract information from File system, Websites, Mail boxes and Mail servers. It supports various file formats like Office, PDF, Zip and lot more. Metadata information is extracted from image files. Aperture has a strong focus on semantics, metadata extracted could be mapped to predefined properties.

LinkChecker


Linkchecker features: - recursive and multithreaded checking and site crawling - output in colored or normal text, HTML, SQL, CSV, XML or a sitemap graph in different formats - HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Telnet and local file links support - restrict link checking with regular expression filters for URLs - proxy support - username/password authorization for HTTP, FTP and Telnet