WebExtractor360 - Open Source Web Extractor

  •        303

WebExtractor360 is a free and open source web data extractor. It uses Regular Expressions to find, extract and scrape internet data quickly and easily. It is very flexible, allowing you to extract both simple and commonly used data and complex data structures like HTML tables.

http://webextractor360.codeplex.com/

Tags
Implementation
License
Platform

   




Related Projects

web-scraper-chrome-extension - Web data extraction tool implemented as chrome extension

  •    Javascript

Web Scraper is a chrome browser extension built for data extraction from web pages. Using this extension you can create a plan (sitemap) how a web site should be traversed and what should be extracted. Using these sitemaps the Web Scraper will navigate the site accordingly and extract all data. Scraped data later can be exported as CSV. When submitting a bug please attach an exported sitemap if possible.

scraperjs - A complete and versatile web scraper.

  •    Javascript

Scraperjs is a web scraper module that make scraping the web an easy job. Try to spot the differences.

Soup - Web Scraper in Go, similar to BeautifulSoup

  •    Go

soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSoup.

market_bot - Google Play Android App store scraper

  •    Ruby

Market Bot is a web scraper (web robot, web spider) for the Google Play Android app store. It can collect data on apps, charts, and developers. Google has recently changed the HTML and CSS for the Play Store. This has caused the release version of Market Bot to break. New code is in the master branch (unreleased) to begin fixing this problem. If you are interesed in helping then please join the discussion in issue 72.

python-goose - Html Content / Article Extractor, web scrapping lib in Python

  •    Python

Html Content / Article Extractor, web scrapping lib in Python


python-goose - Html Content / Article Extractor, web scrapping lib in Python

  •    HTML

Goose was originally an article extractor written in Java that has most recently (Aug2011) been converted to a scala project. This is a complete rewrite in Python. The aim of the software is to take any news article or article-type web page and not only extract what is the main body of the article but also all meta data and most probable image candidate.

x-ray - The next web scraper. See through the <html> noise.

  •    Javascript

Looking for a career upgrade? Check out the available Node.js & Javascript positions at these innovative companies.Flexible schema: Supports strings, arrays, arrays of objects, and nested object structures. The schema is not tied to the structure of the page you're scraping, allowing you to pull the data in the structure of your choosing.

crawler - A high performance web crawler in Elixir.

  •    Elixir

A high performance web crawler in Elixir, with worker pooling and rate limiting via OPQ. Below is a very high level architecture diagram demonstrating how Crawler works.

FlexCP

  •    PHP

FlexCP is a web hosting control panel. It allows complete automation for all hosting related tasks via a highly skinable front end that can be accessed by any web browser. It is developed in portable PHP. It is in large part a rewrite of web://cp.

htmlparser

  •    

Products of the project: Java HTMLParser - VietSpider Web Data Extractor - Extractor VietSpider News. Click on quot;Show project detailsquot; to see more feature about each product.

RSS EXTRACTOR

  •    Java

RSS EXTRACTOR is a java library for generating RSS newsfeeds considering the RSS web feeds from multiple websites. It extracts the best of newsfeed entries and a produces a RSS file which is a fusion of newsfeed entries from several websites.

ferret - Declarative web scraping

  •    Go

ferret is a web scraping system aiming to simplify data extraction from the web for such things like UI testing, machine learning and analytics. Having its own declarative language, ferret abstracts away technical details and complexity of the underlying technologies, helping to focus on the data itself. It's extremely portable, extensible and fast. The following example demonstrates the use of dynamic pages. First of all, we load the main Google Search page, type search criteria into an input box and then click a search button. The click action triggers a redirect, so we wait till its end. Once the page gets loaded, we iterate over all elements in search results and assign the output to a variable. The final for loop filters out empty elements that might be because of inaccurate use of selectors.

python-codeplex-scraper

  •    

This is a simple, lightweight (and probably fragile) web scraper for CodePlex. It allows you to retrieve public information for users and projects.

scraper - Simple web scraping for Google Chrome.

  •    Javascript

Simple web scraping for Google Chrome.

ineed - Web scraping and HTML-reprocessing. The easy way.

  •    Javascript

Web scraping and HTML-reprocessing. The easy way.ineed doesn't build and traverse DOM-tree, it operates on sequence of HTML tokens instead. Whole processing is done in one-pass, therefore, it's blazing fast! The token stream is produced by parse5 which parses HTML exactly the same way modern browsers do.

noodle - A node server and module which allows for cross-domain page scraping on web documents with JSONP or POST

  •    Javascript

The noodle tests create a temporary server on port 8889 which the automated tests tell noodle to query against. Contributors and suggestions welcomed.

ssu - Server-Side Uploader, the data aggregation engine.

  •    Javascript

SSU is a scripted web site navigator & scraper. It was originally designed and conceived as part of Wesabe's infrastructure and has since been open-sourced. Its original design goal was to extract OFX data given bank usernames and passwords for use on wesabe.com. The system it uses to get this data is XulRunner, a project from Mozilla that provides a customizable (and scriptable) browser. SSU has scripts for each financial institution it supports that describes how to log in and download data from that institution's web site.

node-web-scraper - Code for the tutorial: Scraping the Web With Node.js by @kukicado

  •    Javascript

Then it will start up our node server, navigate to http://localhost:8081/scrape and see what happens.

Goutte - Goutte, a simple PHP Web Scraper

  •    PHP

Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extract data from the HTML/XML responses.