Displaying 1 to 18 from 18 results

webmagic - A scalable web crawler framework for Java.

  •    Java

A crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simply the development of a specific crawler.

Norconex HTTP Collector - Enterprise Web Crawler

  •    Java

Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repositoriy of your choice (e.g. a search engine). It very flexible, powerful, easy to extend, and portable.

Nutch - Highly extensible, highly scalable Web crawler

  •    Java

Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

Aperture - Java framework for getting data and metadata

  •    Java

Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems. It could crawl and extract information from File system, Websites, Mail boxes and Mail servers. It supports various file formats like Office, PDF, Zip and lot more. Metadata information is extracted from image files. Aperture has a strong focus on semantics, metadata extracted could be mapped to predefined properties.




fscrawler - Elasticsearch File System Crawler (FS Crawler)

  •    Java

FS Crawler offers a simple way to index binary files into elasticsearch.

Heritrix

  •    Java

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix is designed to respect the robots.txt exclusion directives and META robots tags, and collect material at a measured, adaptive pace unlikely to disrupt normal website activity.

Norconex HTTP Collector - A Web Crawler in Java

  •    Java

Norconex HTTP Collector is a web spider, or crawler that aims to make Enterprise Search integrators and developers's life easier. It is Portable, Extensible, reusable, Robots.txt support, Obtain and manipulate document metadata, Resumable upon failure and lot more.

Storm Crawler - Web crawler SDK based on Apache Storm

  •    Java

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. StormCrawler is a library and collection of resources that developers can leverage to build their own crawlers. The good news is that doing so can be pretty straightforward. Often, all you'll have to do will be to declare StormCrawler as a Maven dependency, write your own Topology class (tip : you can extend ConfigurableTopology), reuse the components provided by the project and maybe write a couple of custom ones for your own secret sauce.


NewPipeExtractor - Core part of NewPipe

  •    Java

NewPipe Extractor is a library for extracting things from streaming sites. It is a core component of NewPipe, but could be used independently.NewPipe Extractor is available at JitPack's Maven repo.

fess-crawler - Web/FileSystem Crawler Library

  •    Java

Fess Crawler is Crawler Framework.

prerender-java - java framework for prerender

  •    Java

Use this java filter that prerenders a javascript-rendered page using an external service and returns the HTML to the search engine crawler for SEO. Note: Make sure you have more than one webserver thread/process running because the prerender service will make a request to your server to render the HTML.

robots

  •    Java

Distributed robots.txt parser and rule checker through API access. If you are working on a distributed web crawler and you want to be polite in your action, then you will find this project very useful. Also, this project can be used to integrate into any SEO tool to check if the content is being indexed correctly by robots.