Displaying 1 to 8 from 8 results

synth - The Declarative Data Generator

  •    Rust

Synth is a tool for generating realistic data using a declarative data model. Synth is database agnostic and can scale to millions of rows of data. Synth solves exactly these problems with a flexible declarative data model which you can version control in git, peer review, and automate.

Synthea - Synthetic Patient Population Simulator

  •    Java

Synthea is a Synthetic Patient Population Simulator. The goal is to output synthetic, realistic (but not real), patient data and associated health records in a variety of formats.

cvdRiskData - R package for Cardiovascular Risk Dataset and Data generation script

  •    R

Synthetic Cardiovascular Risk Dataset and Data generation script available as an R package. If you would like csv versions of this dataset, it is available in the data-raw/ folder of this repo.

pydbgen - Random dataframe and database table generator

  •    Python

While it is easy to generate random numbers or simple words for Pandas or dataframe operation learning, it is often non-trivial to generate full data tables with meaningful yet random entries of most commonly encountered fields in the world of database, such as name, age, birthday, credit card number, SSN, email id, physical address, company name, job title etc. This Python package generates a random database TABLE (or a Pandas dataframe, or an Excel file) based on user's choice of data types (database fields). User can specify the number of samples needed. One can also designate a "PRIMARY KEY" for the database table. Finally, the TABLE is inserted into a new or existing database file of user's choice.

synthia - 📈 🐍 Multidimensional synthetic data generation in Python

  •    Python

Synthetic data need to preserve the statistical properties of real data in terms of their individual behavior and (inter-)dependences (Meyer et al. 2021). Copula and functional Principle Component Analysis (fPCA) are statistical models that allow these properties to be simulated (Joe 2014). As such, copula generated data have shown potential to improve the generalization of machine learning (ML) emulators (Meyer et al. 2021) or anonymize real-data datasets (Patki et al. 2016). Synthia is an open source Python package to model univariate and multivariate data, parameterize data using empirical and parametric methods, and manipulate marginal distributions. It is designed to enable scientists and practitioners to handle labelled multivariate data typical of computational sciences. For example, given some vertical profiles of atmospheric temperature, we can use Synthia to generate new but statistically similar profiles in just three lines of code (Table 1).

BMW-Labeltool-Lite - This repository provides you with an easy-to-use labeling tool for State-of-the-art Deep Learning training purposes

  •    CSharp

Additionally, it is possible to connect a pre-trained or a custom-trained model to the LabelTool lite. This functionality allows one to accelerate the labeling process whereby the connected model can be actively used to suggest appropriate labels for each image. We provide a sample dataset in case you don't have your own custom dataset.

datagene - DataGene - Identify How Similar TS Datasets Are to One Another (by @firmai)

  •    Jupyter

Animated Investment Management Research at Sov.ai, sponsoring open source AI, Machine learning, and Data Science initiatives. DataGene is developed to detect and compare dataset similarity between real and synthetic datasets as well as train, test, and validation datasets. You can read the report on SSRN for additional details. Datasets can largely be compared using quantitative and visual methods. Generated data can take on many formats, it can consist of multiple dimensions of various widths and heights. Original and generated datasets have to be transformed into an acceptable format before they can be compared, these transformation sometimes leads to a reduction in array dimensions. There are two reasons why we might want to reduce array dimensions, the first is to establish an acceptable format to perform distance calculations; the second is the preference for comparing like with like. You can use the MTSS-GAN to generate diverse multivariate time series data using stacked generative adversarial networks in combination with embedding and recurrent neural network models.

mtss-gan - MTSS-GAN: Multivariate Time Series Simulation with Generative Adversarial Networks (by @firmai)


Please experiment with the code in the colab below and give me your feedback in the issues tab. I will read it to improve a future version of this model. The model has been developed on a colaboratory notebook. Here I have added a few code snippets, if there is demand, I can build a package, please let me know in the issues tab. For some additional information, feel free to consult the paper.

We have large collection of open source products. Follow the tags from Tag Cloud >>

Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.