Displaying 1 to 19 from 19 results

webmagic - A scalable web crawler framework for Java.

  •    Java

A crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simply the development of a specific crawler.

Norconex HTTP Collector - Enterprise Web Crawler

  •    Java

Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repositoriy of your choice (e.g. a search engine). It very flexible, powerful, easy to extend, and portable.

Nutch - Highly extensible, highly scalable Web crawler

  •    Java

Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

Aperture - Java framework for getting data and metadata

  •    Java

Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems. It could crawl and extract information from File system, Websites, Mail boxes and Mail servers. It supports various file formats like Office, PDF, Zip and lot more. Metadata information is extracted from image files. Aperture has a strong focus on semantics, metadata extracted could be mapped to predefined properties.




fscrawler - Elasticsearch File System Crawler (FS Crawler)

  •    Java

FS Crawler offers a simple way to index binary files into elasticsearch.

YaCy - Decentralized Web Search

  •    Java

YaCy (read "ya see") is a free distributed search engine, built on principles of peer-to-peer (P2P) networks. It is distributed on several hundred computers so-called YaCy-peers. Each YaCy-peer independently crawls through the Internet, analyzes and indexes found web pages, and stores indexing results in a common database which is shared with other YaCy-peers using principles of P2P networks.

Heritrix

  •    Java

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix is designed to respect the robots.txt exclusion directives and META robots tags, and collect material at a measured, adaptive pace unlikely to disrupt normal website activity.

Norconex HTTP Collector - A Web Crawler in Java

  •    Java

Norconex HTTP Collector is a web spider, or crawler that aims to make Enterprise Search integrators and developers's life easier. It is Portable, Extensible, reusable, Robots.txt support, Obtain and manipulate document metadata, Resumable upon failure and lot more.


Storm Crawler - Web crawler SDK based on Apache Storm

  •    Java

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. StormCrawler is a library and collection of resources that developers can leverage to build their own crawlers. The good news is that doing so can be pretty straightforward. Often, all you'll have to do will be to declare StormCrawler as a Maven dependency, write your own Topology class (tip : you can extend ConfigurableTopology), reuse the components provided by the project and maybe write a couple of custom ones for your own secret sauce.

NewPipeExtractor - Core part of NewPipe

  •    Java

NewPipe Extractor is a library for extracting things from streaming sites. It is a core component of NewPipe, but could be used independently.NewPipe Extractor is available at JitPack's Maven repo.

fess-crawler - Web/FileSystem Crawler Library

  •    Java

Fess Crawler is Crawler Framework.

prerender-java - java framework for prerender

  •    Java

Use this java filter that prerenders a javascript-rendered page using an external service and returns the HTML to the search engine crawler for SEO. Note: Make sure you have more than one webserver thread/process running because the prerender service will make a request to your server to render the HTML.

robots

  •    Java

Distributed robots.txt parser and rule checker through API access. If you are working on a distributed web crawler and you want to be polite in your action, then you will find this project very useful. Also, this project can be used to integrate into any SEO tool to check if the content is being indexed correctly by robots.





We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.