krawler - A web crawling framework written in Kotlin

  •        88

Krawler is a web crawling framework written in Kotlin. It is heavily inspired by crawler4j by Yasser Ganjisaffar. The project is still very new, and those looking for a mature, well tested crawler framework should likely still use crawler4j. For those who can tolerate a bit of turbulence, Krawler should serve as a replacement for crawler4j with minimal modifications to existing applications. Using the Krawler framework is fairly simple. Minimally, there are two methods that must be overridden in order to use the framework. Overriding the shouldVisit method dictates what should be visited by the crawler, and the visit method dictates what happens once the page is visited. Overriding these two methods is sufficient for creating your own crawler, however there are additional methods that can be overridden to privde more robust behavior.

https://github.com/brianmadden/krawler

Tags
Implementation
License
Platform

   




Related Projects

Arachnode.net

  •    CSharp

An open source .NET web crawler written in C# using SQL 2005/2008. Arachnode.net is a complete and comprehensive .NET web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages.

Nutch - Highly extensible, highly scalable Web crawler

  •    Java

Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

Scrapy - Web crawling & scraping framework for Python

  •    Python

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Grub

  •    CSharp

Grub Next Generation is distributed web crawling system (clients/servers) which helps to build and maintain index of the Web. It is client-server architecture where client crawls the web and updates the server. The peer-to-peer grubclient software crawls during computer idle time.

gain - Web crawling framework based on asyncio.

  •    Python

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp. You can add proxy setting to spider as above.


Open Search Server

  •    C++

Open Search Server is both a modern crawler and search engine and a suite of high-powered full text search algorithms. Built using the best open source technologies like lucene, zkoss, tomcat, poi, tagsoup. Open Search Server is a stable, high-performance piece of software.

frontera - A scalable frontier for web crawlers

  •    Python

Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler. Frontera takes care of the logic and policies to follow during the crawl. It stores and prioritises links extracted by the crawler to decide which pages to visit next, and capable of doing it in distributed manner.

crawler4j - Open Source Web Crawler for Java

  •    Java

Open Source Web Crawler for Java

grab - Web Scraping Framework

  •    Python

Project Grab is not abandoned but it is not being actively developed. At current time I am working on another crawling framework which I want to be simple, fast and does not leak memory. New project is located here: https://github.com/lorien/crawler First, I've tried to use mix of asyncio (network) and classic threads (parsing HTML with lxml on multiple CPU cores) but then I've decided to use classic threads for everything for the sake of simplicity. Network requests are processed with pycurl because it is fast, feature-rich and supports socks5 proxies. You can try new framework but be aware it does not have many features yet. In particular, its options to configure network requests are very pure. If you need some option, feel free to create new issue.

Norconex HTTP Collector - Enterprise Web Crawler

  •    Java

Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repositoriy of your choice (e.g. a search engine). It very flexible, powerful, easy to extend, and portable.

Spark - A simple expressive web framework for java

  •    Java

Spark is a micro framework for creating web applications in Kotlin and Java 8 with minimal effort. It is a simple and expressive Java/Kotlin web framework DSL built for rapid development. Sparks intention is to provide an alternative for Kotlin/Java developers that want to develop their web applications as expressive as possible and with minimal boilerplate. With a clear philosophy Spark is designed not only to make you more productive, but also to make your code better under the influence of Spark’s sleek, declarative and expressive syntax.

Heritrix

  •    Java

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix is designed to respect the robots.txt exclusion directives and META robots tags, and collect material at a measured, adaptive pace unlikely to disrupt normal website activity.

WebCrawler and Entity Extraction using Fetch and process frame work

  •    

Web Crawler using Fetch And Process Framework. Yes , it does processing of robots.txt

Ktor - Framework for quickly creating connected applications in Kotlin with minimal effort

  •    Kotlin

Ktor is a framework for quickly creating web applications in Kotlin with minimal effort. Ktor Framework doesn't impose a lot of constraints on what technology a project is going to use – logging, templating, messaging, persistent, serializing, dependency injection, etc. Sometimes it may be required to implement a simple interface, but usually it is a matter of writing a transforming or intercepting function. Features are installed into application using unified interception mechanism which allows building arbitrary pipelines.

grab-site - The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

  •    Python

grab-site is an easy preconfigured web crawler designed for backing up websites. Give grab-site a URL and it will recursively crawl the site and write WARC files. Internally, grab-site uses wpull for crawling. a dashboard with all of your crawls, showing which URLs are being grabbed, how many URLs are left in the queue, and more.

Javalin - A Simple REST API Library for Java / Kotlin

  •    Java

Javalin is a very lightweight web framework for Kotlin and Java, inspired by Sparkjava and koa.js. Javalin is written in Kotlin with a few functional interfaces written in Java. This was necessary to provide an enjoyable and near identical experience for both Kotlin and Java developers.

SharePoint Link Checker

  •    

SharePoint Link Checker can be used by administrators to schedule scans of site collections and report on broken links that are found in publishing content, link fields, rich text fields, summary link fields/web parts and content editor web parts.

kara - Kotlin Web Framework for the JVM

  •    Kotlin

Kara is a web framework for the JVM written in Kotlin. It enables developers to build succinct, type-safe HTML and CSS all in one language. The article Type-safe Web with Kotlin by Andrey Breslav illustrates the benefits of such a framework.

Gigablast - Web and Enterprise search engine in C++

  •    C++

Gigablast is one of the remaining four search engines in the United States that maintains its own searchable index of over a billion pages. It is scalable to thousands of servers. Has scaled to over 12 billion web pages on over 200 servers. It supports Distributed web crawler, Document conversion, Automated data corruption detection and repair, Can cluster results from same site, Synonym search, Spell checker and lot more.

Kanary - A minimalist web framework for building REST APIs in Kotlin/Java.

  •    Kotlin

A light weight🏋🏿 Kotlin web framework for building🔩⚙ highly scalable📈 web APIs