Displaying 1 to 20 from 79 results

webmagic - A scalable web crawler framework for Java.

  •    Java

A crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simply the development of a specific crawler.

Norconex HTTP Collector - Enterprise Web Crawler

  •    Java

Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repositoriy of your choice (e.g. a search engine). It very flexible, powerful, easy to extend, and portable.

Nutch - Highly extensible, highly scalable Web crawler

  •    Java

Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.




Aperture - Java framework for getting data and metadata

  •    Java

Aperture is a Java framework for extracting and querying full-text content and metadata from various information systems. It could crawl and extract information from File system, Websites, Mail boxes and Mail servers. It supports various file formats like Office, PDF, Zip and lot more. Metadata information is extracted from image files. Aperture has a strong focus on semantics, metadata extracted could be mapped to predefined properties.

Grub

  •    CSharp

Grub Next Generation is distributed web crawling system (clients/servers) which helps to build and maintain index of the Web. It is client-server architecture where client crawls the web and updates the server. The peer-to-peer grubclient software crawls during computer idle time.

Open Search Server

  •    C++

Open Search Server is both a modern crawler and search engine and a suite of high-powered full text search algorithms. Built using the best open source technologies like lucene, zkoss, tomcat, poi, tagsoup. Open Search Server is a stable, high-performance piece of software.

AppCrawler - 基于appium的app自动遍历工具

  •    Scala

一个基于自动遍历的app爬虫工具. 支持android和iOS, 支持真机和模拟器. 最大的特点是灵活性. 可通过配置来设定遍历的规则.


fscrawler - Elasticsearch File System Crawler (FS Crawler)

  •    Java

FS Crawler offers a simple way to index binary files into elasticsearch.

Heritrix

  •    Java

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix is designed to respect the robots.txt exclusion directives and META robots tags, and collect material at a measured, adaptive pace unlikely to disrupt normal website activity.

Norconex HTTP Collector - A Web Crawler in Java

  •    Java

Norconex HTTP Collector is a web spider, or crawler that aims to make Enterprise Search integrators and developers's life easier. It is Portable, Extensible, reusable, Robots.txt support, Obtain and manipulate document metadata, Resumable upon failure and lot more.

Storm Crawler - Web crawler SDK based on Apache Storm

  •    Java

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. StormCrawler is a library and collection of resources that developers can leverage to build their own crawlers. The good news is that doing so can be pretty straightforward. Often, all you'll have to do will be to declare StormCrawler as a Maven dependency, write your own Topology class (tip : you can extend ConfigurableTopology), reuse the components provided by the project and maybe write a couple of custom ones for your own secret sauce.

Revenant - A high level PhantomJS headless browser in Node.js ideal for task automation

  •    Javascript

A headless browser powered by PhantomJS functions in Node.js. Based on the PhantomJS-Node bridge.This library aims to abstract many of the simple functions one would use while testing or scraping a web page. Instead of running page.evaluate(...) and entering the javascript functions for a task, these tasks are abstracted for the user.

dhtspider - Bittorrent dht network spider

  •    Javascript

Bittorrent dht network infohash spider, for engiy.com[a bittorrent resource search engine]

lightcrawler - Crawl a website and run it through Google lighthouse

  •    Javascript

Crawl a website and run it through Google lighthouse

NewPipeExtractor - Core part of NewPipe

  •    Java

NewPipe Extractor is a library for extracting things from streaming sites. It is a core component of NewPipe, but could be used independently.NewPipe Extractor is available at JitPack's Maven repo.

ghcrawler - Crawl GitHub APIs and store the discovered orgs, repos, commits, ...

  •    Javascript

A robust GitHub API crawler that walks a queue of GitHub entities retrieving and storing their contents.

bjut_crawler - BJUT secrets~

  •    Javascript

crawl student information from bjut





We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.