Displaying 1 to 8 from 8 results

Hub - Fastest dataset optimization and management for machine and deep learning

  •    Python

Note: the translations of this document may not be up-to-date. For the latest version, please check the README in English. Software 2.0 needs Data 2.0, and Hub delivers it. Most of the time Data Scientists/ML researchers work on data management and preprocessing instead of training models. With Hub, we are fixing this. We store your (even petabyte-scale) datasets as single numpy-like array on the cloud, so you can seamlessly access and work with it from any machine. Hub makes any data type (images, text files, audio, or video) stored in cloud usable as fast as if it were stored on premise. With same dataset view, your team can always be in sync.

nfstream - NFStream: a Flexible Network Data Analysis Framework.

  •    Python

NFStream is a Python framework providing fast, flexible, and expressive data structures designed to make working with online or offline network data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world network data analysis in Python. Additionally, it has the broader goal of becoming a common network data analytics framework for researchers providing data reproducibility across experiments. Binary installers for the latest released version are available on Pypi.

NFStream - A Flexible Network Data Analysis Framework

  •    Python

NFStream is a Python package providing fast, flexible, and expressive data structures designed to make working with online or offline network data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world network data analysis in Python. Additionally, it has the broader goal of becoming a common network data processing framework for researchers providing data reproducibility across experiments. NFStream extracts +90 flow features and can convert it directly to a pandas Dataframe or a CSV file.

MNIST-Sequence - A tool to generate image dataset for sequences of handwritten digits using MNIST database

  •    Python

The MNIST database is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. The goal of this project is to use the above database of handwritten digit images to generate images representing sequences of handwritten digits. The project also provides a utility to generate and save a set of training/test image dataset of MNIST sequences with labels.




Ransomware-Json-Dataset - Compiles a json dataset using public sources that contains properties to aid in the detection and mitigation of over 400 variants of ransomware

  •    Python

Compiles a json dataset containing properties to aid in the detection and mitigation of over 400 variants of ransomware using public sources. The latest version of the Ransomware Summary spreadsheet will then be downloaded and processed into a local json output which will be found in the core folder of your local repository along with a copy of the latest version of the spreadsheet. To change the source and destinations for local files edit the constants found in the header of the 'update_json.py' file.

tpch-tools - Tools for work with the TPC-H benchmark and MonetDB

  •    Shell

Currently, only MonetDB is supported as the DBMS into which data is to be loaded - but in the future this may expand. Feel free to open an issue or write me.

datagene - DataGene - Identify How Similar TS Datasets Are to One Another (by @firmai)

  •    Jupyter

Animated Investment Management Research at Sov.ai, sponsoring open source AI, Machine learning, and Data Science initiatives. DataGene is developed to detect and compare dataset similarity between real and synthetic datasets as well as train, test, and validation datasets. You can read the report on SSRN for additional details. Datasets can largely be compared using quantitative and visual methods. Generated data can take on many formats, it can consist of multiple dimensions of various widths and heights. Original and generated datasets have to be transformed into an acceptable format before they can be compared, these transformation sometimes leads to a reduction in array dimensions. There are two reasons why we might want to reduce array dimensions, the first is to establish an acceptable format to perform distance calculations; the second is the preference for comparing like with like. You can use the MTSS-GAN to generate diverse multivariate time series data using stacked generative adversarial networks in combination with embedding and recurrent neural network models.







We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.