10 sites to get the large data set or data corpus for free

  •        0
  

You may require GBs of data to do performance or load testing. How your app behaves when there is loads of data. You need to know the capacity of your application. This is the frequently asked question from the sales team "The customer is having 100GB of data and he wants to know whether our product will handle this? If so how much RAM / Disk storage required?". This article has pointers to the large data corpus.

How to generate that data? The easiest way would be to have some samples of data, multiply it using some scripts. Another option would be to create data using random values. The main disadvantage of this approach is the data will have very less unique content and it may not give desired results. Below are some links to get large data set.

Wikipedia:Database, Wikipedia offers free copies of all available content to interested users. data is available in multiple languages. Content along with images could be downloaded. http://en.wikipedia.org/wiki/Wikipedia:Database_download

Common crawl builds and maintains an open crawl of the web accessible to everyone. The data is stored in amazon s3bucket and the requester may have spend some money to access it. https://www.commoncrawl.org/

EDRM File Formats Data Set, consists of 381 files covering 200 file formats. http://www.edrm.net/resources/data-sets/edrm-file-format-data-set

Apache Mahout TLP project to create scalable, machine learning algorithms. Mahout has many links to get free and paid corpus data. https://cwiki.apache.org/confluence/display/MAHOUT/Collections

EDRM Enron Email Data Set v2 consist of Enron e-mail messages and attachments in two sets of downloadable compressed files: XML and PST. http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2

ClueWeb09 dataset was created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were collected in January and February 2009. The dataset is used by several tracks of the TREC conference. http://lemurproject.org/clueweb09/

DMOZ - Open Directory Project is the largest, most comprehensive human-edited directory of the Web. It has collections of URLs in different category. Dmoz is one main source for internet search engines. http://www.dmoz.org/rdf.html

theinfo.org - This is a site for large data sets and the people who love them: the scrapers and crawlers who collect them, the academics and geeks who process them, the designers and artists who visualize them. It's a place where they can exchange tips and tricks, develop and share tools together, and begin to integrate their particular projects. http://theinfo.org/

Project Gutenberg offers over 36,000 free ebooks to download to your PC, Kindle, Android, iOS or other portable device. http://www.gutenberg.org/

Million song data set, has data related to tracks and artist. http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets

Mailing list archive, Subscribe to any mailing list, you will get dozens of emails. Many groups or communities provide option to download the mail archive.

Sometimes we may not get the type or format of data we want. In this situitation, we could get these data, write some script to convert it to our desired format.

This large data set helps to do load test your app and understand its capacity and bottleneck. Using these data set you cannot validate the test results. If you build a search engine, you cannot verify that these many number of hits should be returned for a given keyword.

Now every company is moving towards cloud. People talk about big data but there is some way to generate these data, so that the application could be well tested.


   




Related Articles

8 Best Open Source Searchengines built on top of Lucene

  • lucene solr searchengine elasticsearch

Lucene is most powerful and widely used Search engine. Here is the list of 7 search engines which is built on top of Lucene. You could imagine how powerful they are.

Read More


An introduction to LucidWorks Enterprise Search

  • lucene solr search engine enterprise

Lucidworks Enterprise search solution is built on top of Apache Solr. It scales seamlessly w/sub-second response times under extreme query loads for multi-billion document collections. It has user friendly UI, which does all the job of configuration and search.

Read More


Solr vs Elastic Search

  • full-text-search search-engine lucene solr elastic-search

Solr and Elastic Search are built on top of Lucene. Both are open source and both have extra features which makes programmer life easy. This article explains the difference and the best situation to choose between them.

Read More


LucidWorks Vs SearchBlox - Enterprise Search Solution

  • lucene solr searchblox lucidworks enterprise-search

Enterprise search software should be capable to search the data available in the entire organization or personnel desktop. The data could be in File system, Web or in Database. It should search contents of Emails, file formats like doc, xls, ppt, pdf and lot more. There are many commercial products available but LucidWorks and SearchBlox are best and free.

Read More


LogicalDOC - Open Source DMS

  • dms document-management-system

LogicalDOC is both a document management and a collaboration system. The software is loaded with many functions and allows organizing, indexing, retrieving, controlling and distributing important business documents securely and safely for any organization and individual.

Read More


An introduction to MongoDB

  • mongodb database document-oriented-databse no-sql c++ data-mining

MongoDB is the most exciting SQL-free database currently available in the market. The new kid on the block, called MongoDB is a scalable, high-performance, open source, schema free and document oriented database that focuses on the ideas of NoSQL Approach. Written in C++, it has taken rapid strides since its emergence into the public sphere as a popular way to build your database applications.

Read More


SeoToaster: easy, fast and efficient open source CMS for top SEO performance

  • cms content-management-system seo ecommerce

SeoToaster is a free Open Source CMS & Ecommerce solution to build, manage and market websites optimized for for top search engine performance. As the name implies, Seo Toaster is to date the only content management system (CMS) to truly integrate SEO execution and web marketing automation technology in full compliance with the search engines industry’s best practices.

Read More


Lucene / Solr as NoSQL database

  • lucene solr no-sql nosql document-store

Lucene and Solr are most popular and widely used search engine. It indexes the content and delivers the search result faster. It has all capabilities of NoSQL database. This article describes about its pros and cons.

Read More


Free Open Source Code Search Engines

  • code-search code-search-engine search-engine

There are couple of sites which indexes the open source code and provides support to search code. Recently Google announced that they removed code search support from Google code. This article provides pointer for code search engine sites.

Read More


Why Elastic Search is gaining more popularity than Solr?

  • solr elastic-search search-engine

Solr and Elastic Search both are built on top of Lucene library. Both are compratively equal. Any new feature / enhancement which get introduced in Lucene, will also get added to Solr. But still Elastic Search which uses Lucene as it core gained more popularity than Solr in recent years.

Read More


Best situation to use Column database

  • column database reporting

Column oriented database or datastore as the name sounds it stores the data by column rather than by row. It has some advantages and disadvantages over traditional RDBMS. Developer should know the typical situation to choose column oriented database.

Read More


Column database vs OLAP

  • business-intelligence olap column-database

OLAP (Online Analytical Processing), Reporting, Data mining related tasks are usually done by Business intelligence products. They do powerful Extraction, Transformation and Loading (ETL) the data and provides various reports. They use relational database as its back end. How could they generate better reports? Will column DB do a better job?

Read More


Restrict Solr Admin Access

  • solr searchengine tips

Solr is a search engine built on top of Lucene. It supports REST interface and has lot of built-in capabilities. Solr package has Admin UI interface which has support to perform query and even delete the contents of the index. If you are using Solr in production then you may need to restrict access. I saw couple of questions in the group related to this topic. Thought to write an article explaining few tips to restrict the user access to Solr admin UI.

Read More


10 Free services for your Website / Blog. Just plug it.

  • free website blog free-service free-resources

Each website / blog delivers useful content or service to its users. But website themselves requires some service to monitor and increase its presence. Here are few free services which could be used by Website / Blog. This will be very much helpful for small business owners.

Read More


Advantages and Disadvantages of using Hibernate like ORM libraries

  • database orm

Traditionally Programmers used ODBC, JDBC, ADO etc to access database. Developers need to write SQL queries, process the result set and convert the data in the form of objects (Data model). I think most programmers would typically write a function to convert the object to query and result set to object. To overcome these difficulties, ORM provides a mechanism to directly use objects and interact with the database.

Read More


Why require Searchengine? Why not use database for full text search in Enterprise application.

  • searchengine database

Most of the database has support of full text search, basically indexing and saarching. MySQL, Oracle and many more databases has in-built full text search. Then what is the need to go for external search engine like Lucene, Sphinx, Solr etc. Check out the advantage of using Searchengine.

Read More


Whats new in Lucene / Solr 4.0

  • lucene solr new-release

The release 4.0 is one of the important milestone for Lucene and Solr. It has lot of new features and performance important. Few important ones are highliggted in this article.

Read More


How to create SEO friendly url

  • seo url searchengine

SEO friendly URL is recommended for any website which wants to be indexed and wants its presence in search results. Searchengine mostly index the static URL. It will avoid the URL which has lot of query strings. Almost all websites generate content dynamically then how could the URL be static. That is the job of the programmer.

Read More


Microweber CMS - An open source CMS with Ecommerce support

  • cms e-commerce microweber

To the user's satisfaction, there is a whole wide world of different CMS, all suitable for different needs. You can go for the giants like Wordpress or Joomla or pick one of the rising forces - Shopify, Squarespace or others. Microweber CMS fills a hole in the current technological ecosystem, aimed at delivering a light software that is perfect for all end-users lacking the technical knowledge needed for complicated website building.

Read More


Git vs Subversion

  • version-control subversion git

Git and Subversion are most popular and widely used version control system. What is the best situation to choose them? It is important to know its pros and cons, evaluate your requirement and choose the right one.

Read More