Displaying 1 to 20 from 91 results

newspaper - 💡 News, full-text, and article metadata extraction in Python 3. Advanced docs:

  •    Python

Newspaper has seamless language extraction and detection. If no language is specified, Newspaper will attempt to auto detect a language. Check out The Documentation for full and detailed guides using newspaper.

SwiftLinkPreview - It makes a preview from an URL, grabbing all the information such as title, relevant texts and images

  •    Swift

It makes a preview from an URL, grabbing all the information such as title, relevant texts and images. To use SwiftLinkPreview as a pod package just add the following in your Podfile file.

crawler - An easy to use, powerful crawler implemented in PHP. Can execute Javascript.

  •    PHP

This package provides a class to crawl links on a website. Under the hood Guzzle promises are used to crawl multiple urls concurrently. Because the crawler can execute JavaScript, it can crawl JavaScript rendered sites. Under the hood Chrome and Puppeteer are used to power this feature.

annie - 👾 Fast, simple and clean video downloader

  •    Go

👾 Annie is a fast, simple and clean video downloader built with Go. The following dependencies are required and must be installed separately.




headless-chrome-crawler - Distributed crawler powered by Headless Chrome

  •    Javascript

Crawlers based on simple requests to HTML files are generally fast. However, it sometimes ends up capturing empty bodies, especially when the websites are built on such modern frontend frameworks as AngularJS, React and Vue.js. Note: headless-chrome-crawler contains Puppeteer. During installation, it automatically downloads a recent version of Chromium. To skip the download, see Environment variables.

ferret - Declarative web scraping

  •    Go

ferret is a web scraping system aiming to simplify data extraction from the web for such things like UI testing, machine learning and analytics. Having its own declarative language, ferret abstracts away technical details and complexity of the underlying technologies, helping to focus on the data itself. It's extremely portable, extensible and fast. The following example demonstrates the use of dynamic pages. First of all, we load the main Google Search page, type search criteria into an input box and then click a search button. The click action triggers a redirect, so we wait till its end. Once the page gets loaded, we iterate over all elements in search results and assign the output to a variable. The final for loop filters out empty elements that might be because of inaccurate use of selectors.


scylla - Intelligent proxy pool for Humans™

  •    Python

对于偏好中文的用户,请阅读 中文文档。For those who prefer to use Chinese, please read the Chinese Documentation. This is an example of running a service locally (localhost), using port 8899.

RED_HAWK - All in one tool for Information Gathering, Vulnerability Scanning and Crawling

  •    PHP

RED HAWK's CMS Detector currently is able to detect the following CMSs (Content Management Systems) in case the website is using some other CMS, Detector will return could not detect. Want to contribute to RED HAWK or point out something wrong? Just create a new issue here: https://github.com/Tuhinshubhra/RED_HAWK/issues/new I'd love to hear from you.

dirhunt - Find web directories without bruteforce

  •    Python

DEVELOPMENT BRANCH: The current branch is a development version. Go to the stable release by clicking on the master branch. Dirhunt is a web crawler optimize for search and analyze directories. This tool can find interesting things if the server has the "index of" mode enabled. Dirhunt is also useful if the directory listing is not enabled. It detects directories with false 404 errors, directories where an empty index file has been created to hide things and much more.

Crawler-Detect - 🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

  •    PHP

CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent and http_from header. Currently able to detect 1,000's of bots/spiders/crawlers. Run composer require jaybizzle/crawler-detect 1.* or add "jaybizzle/crawler-detect" :"1.*" to your composer.json.

Lulu - [Unmaintained] A simple and clean video/music/image downloader 👾

  •    Python

Sorry for this. Lulu is a friendly you-get fork (⏬ Dumb downloader that scrapes the web).

huntsman - Super configurable async web spider

  •    Javascript

Huntsman takes one or more 'seed' urls with the spider.queue.add() method. Once the process is kicked off with spider.start(), it will take care of extracting links from the page and following only the pages we want.

Soup - Web Scraper in Go, similar to BeautifulSoup

  •    Go

soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSoup.

node-Tor - Javascript implementation of the Tor (or Tor like) anonymizer project (The Onion Router)

  •    Javascript

For a quick look, see the demo video on Peersm, download and stream anonymously inside your browser, serverless anonynous P2P network compatible with torrents. Check out torrent-live for a more general presentation and to get the dynamic blocklist.

Revenant - A high level PhantomJS headless browser in Node.js ideal for task automation

  •    Javascript

A headless browser powered by PhantomJS functions in Node.js. Based on the PhantomJS-Node bridge.This library aims to abstract many of the simple functions one would use while testing or scraping a web page. Instead of running page.evaluate(...) and entering the javascript functions for a task, these tasks are abstracted for the user.

goredis-crawler - Cross-platform persistent and distributed web crawler :ant: :computer:

  •    Go

A cross-platform persistent and distributed web crawler.goredis-crawler is persistent because the queue is stored in a remote database that is automatically re-initialized if interrupted. goredis-crawler is distributed because multiple instances of goredis-crawler will work on the remotely stored queue, so you can start as many crawlers as you want on separate machines to speed along the process. goredis-crawler is also fast because it is threaded and uses connection pools.

linkcrawler - Cross-platform persistent and distributed web crawler :link:

  •    Go

linkcrawler is persistent because the queue is stored in a remote database that is automatically re-initialized if interrupted. linkcrawler is distributed because multiple instances of linkcrawler will work on the remotely stored queue, so you can start as many crawlers as you want on separate machines to speed along the process. linkcrawler is also fast because it is threaded and uses connection pools.Crawl responsibly.