spyck - Framework extensível para mineração de dados

  •        18

An extensible framework for data mining. spyck is a framework which aims to make it easy to develop crawlers and integrate collected data - independent of its type and origin. It's easily expandable and adaptable. It also aims to be easy to use, even for beginners.

http://zetaresearch.github.io/projects/spyck
https://github.com/macabeus/spyck

Tags
Implementation
License
Platform

   




Related Projects

Crawler-Detect - πŸ•· CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

  •    PHP

CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent and http_from header. Currently able to detect 1,000's of bots/spiders/crawlers. Run composer require jaybizzle/crawler-detect 1.* or add "jaybizzle/crawler-detect" :"1.*" to your composer.json.

crawlers - Some crawlers u know it:-)

  •    Python

Some crawlers u know it:-)

Storm Crawler - Web crawler SDK based on Apache Storm

  •    Java

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. StormCrawler is a library and collection of resources that developers can leverage to build their own crawlers. The good news is that doing so can be pretty straightforward. Often, all you'll have to do will be to declare StormCrawler as a Maven dependency, write your own Topology class (tip : you can extend ConfigurableTopology), reuse the components provided by the project and maybe write a couple of custom ones for your own secret sauce.

Silverlight SEO Project

  •    Silverlight

The Silverlight SEO Project is designed to simplify configuring sites that host Silverlight Navigation applications for search engine optimization, by providing html content to search engine crawlers and Silverlight to users with the plug-in installed.

Mvc Xml Sitemap

  •    

MVC Sitemap makes it a snap for your ASP.NET MVC based web site to expose a sitemap xml file to search engine crawlers. Simply place a [Sitemap] attribute on all Actions you want crawled and create an action for the sitemap - it's that easy.


Open Search Server

  •    C++

Open Search Server is both a modern crawler and search engine and a suite of high-powered full text search algorithms. Built using the best open source technologies like lucene, zkoss, tomcat, poi, tagsoup. Open Search Server is a stable, high-performance piece of software.

nginx-badbot-blocker - Block bad, possibly even malicious web crawlers (automated bots) using Nginx

  •    Shell

223 (and growing) Nginx rules to block bad bots. If you have a bizarre or complicated setup, be sure to look everything over before installing. However, for anyone with a fairly straightforward Nginx installation, this should work without any issues.

Spido

  •    Java

Spido is a Java based web mining platform providing a full set of web crawlers and management features.

search engine optimization - cms

  •    

SEO-CMS (Search Engine Optmized Content Management System)SEO-CMS helps a site in easily being crawled by a search engine also provides help for- easy link management, theme based, tracks crawlers, stats tracks refers, keywords, overture amp; adwords.

Honeypot analysis 2005

  •    

Projecttarget is to uncover the attacks of spammers and make their actions visible. Catching their email bloodhounds (spiders, crawlers) und redirect them to their homes to get the spammers IP-adresses. Project is based on the honeypot principle.

Noti

  •    

Noti is a news publishing framework. It consist of several crawlers that fetch news web pages, index their content, and publish it in a centralized web site, providing user customized feeds and daily emails of the lastest news.

Music Ontology tools

  •    Python

Some tools related to the Music Ontology - including domain-specific Semantic Web crawlers, audio collection management and mapping tools

headless-chrome-crawler - Distributed crawler powered by Headless Chrome

  •    Javascript

Crawlers based on simple requests to HTML files are generally fast. However, it sometimes ends up capturing empty bodies, especially when the websites are built on such modern frontend frameworks as AngularJS, React and Vue.js. Note: headless-chrome-crawler contains Puppeteer. During installation, it automatically downloads a recent version of Chromium. To skip the download, see Environment variables.

ProgressKit - Progress Views for Cocoa

  •    Swift

ProgressKit has set of cool IBDesignable progress views, with huge customisation options. You can now make spinners, progress bar, crawlers etc, which can be finely customised according to your app palette. ##CocoaPods CocoaPods adds supports for Swift and embedded frameworks.

rendora - dynamic server-side rendering using headless Chrome to effortlessly solve the SEO problem for modern javascript websites

  •    Go

Rendora can be seen as a reverse HTTP proxy server sitting between your backend server (e.g. Node.js/Express.js, Python/Django, etc...) and potentially your frontend proxy server (e.g. nginx, traefik, apache, etc...) or even directly to the outside world that does actually nothing but transporting requests and responses as they are except when it detects whitelisted requests according to the config. In that case, Rendora instructs a headless Chrome instance to request and render the corresponding page and then return the server-side rendered page back to the client (i.e. the frontend proxy server or the outside world). This simple functionality makes Rendora a powerful dynamic renderer without actually changing anything in both frontend and backend code. Dynamic rendering means that the server provides server-side rendered HTML to web crawlers such as GoogleBot and BingBot and at the same time provides the typical initial HTML to normal users in order to be rendered at the client side. Dynamic rendering is meant to improve SEO for websites written in modern javascript frameworks like React, Vue, Angular, etc...

LambdaHack - Haskell game engine library for roguelike dungeon crawlers; try out the browser version at

  •    Haskell

As an example of the engine's capabilities, here is a showcase of shooting down explosive projectiles. A couple were shot down close enough to enemies to harm them. Others exploded closer to our party members and took out of the air the projectiles that would otherwise harm them. This was a semi-automatic stealthy speedrun of the escape scenario of the sample game that comes with the engine. Small fixed font. The enemy gang has a huge numerical and equipment superiority. Our team loots the area on auto-pilot until the first foe is spotted. Then they scout out enemy positions. Then hero 1 draws enemies and unfortunately enemy fire as well, which is when he valiantly shoots down explosives to avoid the worst damage. Then heroine 2 sneaks behind enemy lines to reach the remaining treasure. That accomplished, the captain signals retreat and leaves for the next area (the zoo).

frontera - A scalable frontier for web crawlers

  •    Python

Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler. Frontera takes care of the logic and policies to follow during the crawl. It stores and prioritises links extracted by the crawler to decide which pages to visit next, and capable of doing it in distributed manner.

prerender-node - Express middleware for prerendering javascript-rendered pages on the fly for SEO

  •    Javascript

Google, Facebook, Twitter, Yahoo, and Bing are constantly trying to view your website... but they don't execute JavaScript. That's why we built Prerender. Prerender is perfect for AngularJS SEO, BackboneJS SEO, EmberJS SEO, and any other JavaScript framework. This middleware intercepts requests to your Node.js website from crawlers, and then makes a call to the (external) Prerender Service to get the static HTML instead of the JavaScript for that page.

prerender_rails - Rails middleware gem for prerendering javascript-rendered pages on the fly for SEO

  •    Ruby

Google, Facebook, Twitter, Yahoo, and Bing are constantly trying to view your website... but they don't execute javascript. That's why we built Prerender. Prerender is perfect for AngularJS SEO, BackboneJS SEO, EmberJS SEO, and any other javascript framework. This middleware intercepts requests to your Rails website from crawlers, and then makes a call to the (external) Prerender Service to get the static HTML instead of the javascript for that page.

robotstxt - The repository contains Google's robots

  •    C++

The repository contains Google's robots.txt parser and matcher as a C++ library (compliant to C++11). The Robots Exclusion Protocol (REP) is a standard that enables website owners to control which URLs may be accessed by automated clients (i.e. crawlers) through a simple text file with a specific syntax. It's one of the basic building blocks of the internet as we know it and what allows search engines to operate.