10 sites to get the large data set or data corpus for free

  •        0
  

We aggregate and tag open source projects. We have collections of more than one million projects. Check out the projects section.



You may require GBs of data to do performance or load testing. How your app behaves when there is loads of data. You need to know the capacity of your application. This is the frequently asked question from the sales team "The customer is having 100GB of data and he wants to know whether our product will handle this? If so how much RAM / Disk storage required?". This article has pointers to the large data corpus.

How to generate that data? The easiest way would be to have some samples of data, multiply it using some scripts. Another option would be to create data using random values. The main disadvantage of this approach is the data will have very less unique content and it may not give desired results. Below are some links to get large data set.

Wikipedia:Database, Wikipedia offers free copies of all available content to interested users. data is available in multiple languages. Content along with images could be downloaded. http://en.wikipedia.org/wiki/Wikipedia:Database_download

Common crawl builds and maintains an open crawl of the web accessible to everyone. The data is stored in amazon s3bucket and the requester may have spend some money to access it. https://www.commoncrawl.org/

EDRM File Formats Data Set, consists of 381 files covering 200 file formats. http://www.edrm.net/resources/data-sets/edrm-file-format-data-set

Apache Mahout TLP project to create scalable, machine learning algorithms. Mahout has many links to get free and paid corpus data. https://cwiki.apache.org/confluence/display/MAHOUT/Collections

EDRM Enron Email Data Set v2 consist of Enron e-mail messages and attachments in two sets of downloadable compressed files: XML and PST. http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2

ClueWeb09 dataset was created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were collected in January and February 2009. The dataset is used by several tracks of the TREC conference. http://lemurproject.org/clueweb09/

DMOZ - Open Directory Project is the largest, most comprehensive human-edited directory of the Web. It has collections of URLs in different category. Dmoz is one main source for internet search engines. http://www.dmoz.org/rdf.html

theinfo.org - This is a site for large data sets and the people who love them: the scrapers and crawlers who collect them, the academics and geeks who process them, the designers and artists who visualize them. It's a place where they can exchange tips and tricks, develop and share tools together, and begin to integrate their particular projects. http://theinfo.org/

Project Gutenberg offers over 36,000 free ebooks to download to your PC, Kindle, Android, iOS or other portable device. http://www.gutenberg.org/

Million song data set, has data related to tracks and artist. http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets

Mailing list archive, Subscribe to any mailing list, you will get dozens of emails. Many groups or communities provide option to download the mail archive.

Sometimes we may not get the type or format of data we want. In this situitation, we could get these data, write some script to convert it to our desired format.

This large data set helps to do load test your app and understand its capacity and bottleneck. Using these data set you cannot validate the test results. If you build a search engine, you cannot verify that these many number of hits should be returned for a given keyword.

Now every company is moving towards cloud. People talk about big data but there is some way to generate these data, so that the application could be well tested.


Sponsored:
To find embedded technology information about MCU, IoT, AI etc Check out embedkari.com.


   

We publish blog post about open source products. If you are interested in sharing knowledge about open source products, please visit write for us

Subscribe to our newsletter.

We will send mail once in a week about latest updates on open source tools and technologies. subscribe our newsletter



Related Articles

8 Best Open Source Searchengines built on top of Lucene

  • lucene solr searchengine elasticsearch

Lucene is most powerful and widely used Search engine. Here is the list of 7 search engines which is built on top of Lucene. You could imagine how powerful they are.

Read More


Introduction to Apache Cassandra

  • cassandra database nosql

Apache Cassandra was designed by Facebook and was open-sourced in July 2008. It is regarded as perfect choice when the users demand scalability and high availability without any impact towards performance. Apache Cassandra is highly scalable, high-performance distributed database designed to handle large voluminous amounts of data across many commodity servers with no failure.

Read More


Exonum Blockchain Framework by the Bitfury Group

  • blockchain bitcoin hyperledger blockchain-framework

Exonum is an extensible open source blockchain framework for building private blockchains which offers outstanding performance, data security, as well as fault tolerance. The framework does not include any business logic, instead, you can develop and add the services that meet your specific needs. Exonum can be used to build various solutions from a document registry to a DevOps facilitation system.

Read More


AbanteCart - Easy to use open source e-commerce platform, helps selling online

  • e-commerce ecommerce cart

AbanteCart is a free, open source shopping cart that was built by developers with a passion for free and accessible software. Founded in 2010 (launched in 2011), the platform is coded in PHP and supports MySQL. AbanteCart’s easy to use admin and basic layout management tool make this open source solution both easy to use and customizable, depending on the skills of the user. AbanteCart is very user-friendly, it is entirely possible for a user with little to no coding experience to set up and use this cart. If the user would be limited to the themes and features available in base AbanteCart, there is a marketplace where third-party extensions or plugins come to the rescue.

Read More


PySpark: Installation, RDDs, MLib, Broadcast and Accumulator

  • pyspark spark python rdd big-data

We knew that Apace Spark- the most famous parallel computing model or processing the massive data set is written in Scala programming language. The Apace foundation offered a tool to support the Python in Spark which was named PySpark. The PySpark allows us to use RDDs in Python programming language through a library called Py4j. This article provides basic introduction about PySpark, RDD, MLib, Broadcase and Accumulator.

Read More



An introduction to LucidWorks Enterprise Search

  • lucene solr search engine enterprise

Lucidworks Enterprise search solution is built on top of Apache Solr. It scales seamlessly w/sub-second response times under extreme query loads for multi-billion document collections. It has user friendly UI, which does all the job of configuration and search.

Read More


Best open source Text Editors

  • text-editor editor tools dev-tools

Text editors are mainly used by programmers and developers for manipulating plain text source code, editing configuration files or preparing documentation and even viewing error logs. Text editors is a piece of software which enables to create, modify and delete files that a programmer is using while creating website or mobile app.In this article, we will discuss about top 7 all-round performing text editors which is highly supportive for programmers.

Read More


Holistic usage guide for OpenSSL

  • openssl security certificate tools

OpenSSL is a general purpose cryptographty toolkit that provides an open source implementation of Transport Layer Security(TLS) and Secure Socket Layer(SSL) protocols. It is written in C,assembly and Perl language but wrappers are available in all languages. This article explains about OpenSSL commands.

Read More


Enhancing The Experience With Android TV: Briefly About One Developer's Impressions

  • android android-tv

In my software development career, I have always been attracted to new technologies and innovative solutions. Project by project, I have been exploring something new, discovering and implementing different approaches and trying new solutions. When Android TV showed up, I set a new personal goal. I described my impressions and the overall opinion on the application development for Android TV right here. Hope you will find it useful and interesting.

Read More


Connect to MongoDB and Perform CRUD using Java

  • java mongodb database programming

MongoDB is a popular and widely used open source NoSQL database. MongoDB is a distributed database at its core, so high availability, horizontal scaling, and geographic distribution is quite possible. It is licensed under Server Side Public License. Recently they moved to Server Side Public License, before that MongoDB was released under AGPL. This article will provide basic example to connect and work with MongoDB using Java.

Read More


Blockchain Technology Will Revolutionize These Five Industries

  • blockchain security industries

Blockchain is buzzing across the world with its endless potential for disrupting a broad spectrum of the industries beyond the storage and transferring the values. Cryptocurrencies are revolutionizing variant industries and plenty of people talk about it with and their potential. Blockchain technology may have been made for bitcoins but it has immense potential for making the transactions secure and exciting. The technology is not ubiquitous but it already seems to be disrupting a huge number of the highly centralized industries that demonstrate practical use cases for changing the way businesses are done.

Read More


Solr vs Elastic Search

  • full-text-search search-engine lucene solr elastic-search

Solr and Elastic Search are built on top of Lucene. Both are open source and both have extra features which makes programmer life easy. This article explains the difference and the best situation to choose between them.

Read More


LucidWorks Vs SearchBlox - Enterprise Search Solution

  • lucene solr searchblox lucidworks enterprise-search

Enterprise search software should be capable to search the data available in the entire organization or personnel desktop. The data could be in File system, Web or in Database. It should search contents of Emails, file formats like doc, xls, ppt, pdf and lot more. There are many commercial products available but LucidWorks and SearchBlox are best and free.

Read More


LogicalDOC - Open Source DMS

  • dms document-management-system

LogicalDOC is both a document management and a collaboration system. The software is loaded with many functions and allows organizing, indexing, retrieving, controlling and distributing important business documents securely and safely for any organization and individual.

Read More


Quick Start Programming Guide for redis using java client Jedis

  • redis jedis redis-client programming database java

Redis is an open source (BSD licensed), in-memory data structure store, used also as a database cache and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes with radius queries and streams. This article explains about how to communicate with Redis using Java client Jedis.

Read More


SeoToaster: easy, fast and efficient open source CMS for top SEO performance

  • cms content-management-system seo ecommerce

SeoToaster is a free Open Source CMS & Ecommerce solution to build, manage and market websites optimized for for top search engine performance. As the name implies, Seo Toaster is to date the only content management system (CMS) to truly integrate SEO execution and web marketing automation technology in full compliance with the search engines industry’s best practices.

Read More


Getting Started on Angular 7

  • angular ui-ux front-end-framework

Angular is a platform for building responsive web, native desktop and native mobile applications. Angular client applications are built using HTML, CSS and Typescript. Typescript is a typed superset of Javascript that compiles to plain Javascript. Angular core and optional modules are built using Typescript. Code has been licensed as MIT License.

Read More


An introduction to MongoDB

  • mongodb database document-oriented-databse no-sql c++ data-mining

MongoDB is the most exciting SQL-free database currently available in the market. The new kid on the block, called MongoDB is a scalable, high-performance, open source, schema free and document oriented database that focuses on the ideas of NoSQL Approach. Written in C++, it has taken rapid strides since its emergence into the public sphere as a popular way to build your database applications.

Read More


Advanced Programming Guide in Redis using Jedis

  • redis jedis advanced-guide cluster pipline publish-subscribe

Redis is an in-memory data structure store, used as a database, cache and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes with radius queries and streams. This blog covers the advanced concepts like cluster, publish and subscribe, pipeling concepts of Redis using Jedis Java library.

Read More


Push Notifications using Angular

  • angular push-notifications notifications

Notifications is a message pushed to user's device passively. Browser supports notifications and push API that allows to send message asynchronously to the user. Messages are sent with the help of service workers, it runs as background tasks to receive and relay the messages to the desktop if the application is not opened. It uses web push protocol to register the server and send message to the application. Once user opt-in for the updates, it is effective way of re-engaging users with customized content.

Read More