parso - lightweight Java library designed to read SAS7BDAT datasets

  •        251

Parso is a lightweight Java library designed to read SAS7BDAT datasets. The Parso interfaces are analogous to libraries designed to read table-storing files, for example, CSVReader library. Despite its small size, the Parso library is the only full-featured open-source solution to process SAS7BDAT datasets, both uncompressed, CHAR-compressed and BIN-compressed. It is effective in processing clinical and statistical data often stored in SAS7BDAT format. Parso allows converting data into CSV format.

http://lifescience.opensource.epam.com/parso.html
https://github.com/epam/parso

Dependencies:

org.slf4j:slf4j-api:1.7.5

Tags
Implementation
License
Platform

   




Related Projects

haven - Read SPSS, Stata and SAS files from R

  •    C

SAS: read_sas() reads .sas7bdat + .sas7bcat files and read_xpt() reads SAS transport files (version 5 and version 8). write_sas() writes .sas7bdat files. SPSS: read_sav() reads .sav files and read_por() reads the older .por files. write_sav() writes .sav files.

nvvl - A library that uses hardware acceleration to load sequences of video frames to facilitate machine learning training

  •    C++

NVVL (NVIDIA Video Loader) is a library to load random sequences of video frames from compressed video files to facilitate machine learning training. It uses FFmpeg's libraries to parse and read the compressed packets from video files and the video decoding hardware available on NVIDIA GPUs to off-load and accelerate the decoding of those packets, providing a ready-for-training tensor in GPU device memory. NVVL can additionally perform data augmentation while loading the frames. Frames can be scaled, cropped, and flipped horizontally using the GPUs dedicated texture mapping units. Output can be in RGB or YCbCr color space, normalized to [0, 1] or [0, 255], and in float, half, or uint8 tensors. Using compressed video files instead of individual frame image files significantly reduces the demands on the storage and I/O systems during training. Storing video datasets as video files consumes an order of magnitude less disk space, allowing for larger datasets to both fit in system RAM as well as local SSDs for fast access. During loading fewer bytes must be read from disk. Fitting on smaller, faster storage and reading fewer bytes at load time allievates the bottleneck of retrieving data from disks, which will only get worse as GPUs get faster. For the dataset used in our example project, H.264 compressed .mp4 files were nearly 40x smaller than storing frames as .png files.

sparse-voxel-octrees - CPU Sparse Voxel Octree Implementation

  •    C++

This project provides a multithreaded, CPU Sparse Voxel Octree implementation in C++, capable of raytracing large datasets in real-time, converting raw voxel files to octrees and converting mesh data (in form of PLY files) to voxel octrees. The conversion routines are capable of handling datasets much larger than the working memory, allowing the creation and rendering of very large octrees (resolution 8192x8192x8192 and up).

pykitti - Python tools for working with KITTI data.

  •    Python

This package provides a minimal set of tools for working with the KITTI dataset [1] in Python. So far only the raw datasets and odometry benchmark datasets are supported, but we're working on adding support for the others. We welcome contributions from the community. This package assumes that you have also downloaded the calibration data associated with the sequences you want to work on (these are separate files from the sequences themselves), and that the directory structure is unchanged from the original structure laid out in the KITTI zip files.

datasets - 🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

  •    Python

🤗Datasets also provides access to +15 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. 🤗Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗Datasets and tfds can be found in the section Main differences between 🤗Datasets and tfds.


Hub - Fastest dataset optimization and management for machine and deep learning

  •    Python

Note: the translations of this document may not be up-to-date. For the latest version, please check the README in English. Software 2.0 needs Data 2.0, and Hub delivers it. Most of the time Data Scientists/ML researchers work on data management and preprocessing instead of training models. With Hub, we are fixing this. We store your (even petabyte-scale) datasets as single numpy-like array on the cloud, so you can seamlessly access and work with it from any machine. Hub makes any data type (images, text files, audio, or video) stored in cloud usable as fast as if it were stored on premise. With same dataset view, your team can always be in sync.

covid-19-open-data - Datasets of daily time-series data related to COVID-19 for over 20,000 distinct locations around the world

  •    Python

The data is drawn from multiple sources, as listed below, and stored in separate tables as CSV files grouped by context, which can be easily merged due to the use of consistent geographic (and temporal) keys as it is done for the main table. 1 key is a unique string for the specific geographical region built from a combination of codes such as ISO 3166, NUTS, FIPS and other local equivalents. 2 Refer to the data sources for specifics about each data source and the associated terms of use. 3 Datasets without a date column contain the most recently reported information for each datapoint to date.

awesome-public-datasets - A topic-centric list of high-quality open datasets in public domains

  •    

NOTICE: This repo is automatically generated by apd-core. Please DO NOT modify this file directly. We have provided a new way to contribute to Awesome Public Datasets. The original PR entrance directly on repo is closed forever. This list of a topic-centric public data sources in high quality. They are collected and tidied from blogs, answers, and user responses. Most of the data sets listed below are free, however, some are not. Other amazingly awesome lists can be found in sindresorhus's awesome list.

datasets - TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...

  •    Python

TensorFlow Datasets provides many public datasets as tf.data.Datasets. To install and use TFDS, we strongly encourage to start with our getting started guide. Try it interactively in a Colab notebook.

Grassroots DICOM

  •    Java

Cross-platform DICOM implementation

Analyzing-Visualizing-Data-PowerBI - This repository contains the lab files and other resources for the free Microsoft course DAT207x: Analyzing and Visualizing Data with Power BI

  •    

This repository contains the lab files and other resources for the free Microsoft course DAT207x: Analyzing and Visualizing Data with Power BI. To learn how to connect, explore, and visualize data with Power BI, sign up for this course on edX. Throughout the course you will use examples and datasets provided through text files, Excel workbooks, SQL backup, and Access database. They are provided "as-is." Information and views expressed in the workbooks, including URL and other Internet Web site references, may change without notice. You bear the risk of using it. Some examples are for illustration only and are fictitious. No real association is intended or inferred. Microsoft makes no warranties, express or implied, with respect to the information provided here.

dicom - High Performance DICOM Medical Image Parser in Go

  •    Go

This is a library and command-line tool to read, write, and generally work with DICOM medical image files in native Go. The goal is to build a full-featured, high-performance, and readable DICOM parser for the Go community.

awesome-json-datasets - A curated list of awesome JSON datasets that don't require authentication.

  •    Javascript

A curated list of awesome JSON datasets that don't require authentication. Pro Tip: Check out Blockchain Data API for more options.

Machine-Learning-with-R-datasets - Formatted datasets for Machine Learning With R by Brett Lantz

  •    

Machine Learning with R by Brett Lantz is a book that provides an introduction to machine learning using R. As far as I can tell, Packt Publishing does not make its datasets available online unless you buy the book and create a user account which can be a problem if you are checking the book out from the library or borrowing the book from a friend. All of these datasets are in the public domain but simply needed some cleaning up and recoding to match the format in the book.

nlp-datasets - A list of datasets/corpora for NLP tasks, in reverse chronological order.

  •    

This is a list of datasets/corpora for NLP tasks, in reverse chronological order. Suggestions and pull requests are welcome. The goal is to make this a collaborative effort to maintain an updated list of quality datasets.

wiki-reading - This repository contains the three WikiReading datasets as used and described in WikiReading: A Novel Large-scale Language Understanding Task over Wikipedia, Hewlett, et al, ACL 2016 (the English WikiReading dataset) and Byte-level Machine Reading across Morphologically Varied Languages, Kenter et al, AAAI-18 (the Turkish and Russian datasets)

  •    Python

This repository contains the three WikiReading datasets as used and described in WikiReading: A Novel Large-scale Language Understanding Task over Wikipedia, Hewlett, et al, ACL 2016 (the English WikiReading dataset) and Byte-level Machine Reading across Morphologically Varied Languages, Kenter et al, AAAI-18 (the Turkish and Russian datasets). Run get_data.sh to download data the English WikiReading dataset.

owid-datasets - OWID Dataset Collection

  •    

This is an ongoing collection of datasets with source information in CSV+datapackage format, exported automatically from the ourworldindata.org database. Most datasets included here are annual time series data for social and economic indicators by country. The repository covers mainly smaller datasets which have been individually uploaded and annotated with source information by OWID authors. We also use some larger external data collections on the website like the World Development Indicators, which aren't currently included here.

gpt-3 - GPT-3: Language Models are Few-Shot Learners

  •    

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions – something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

csvtotable - Simple command-line utility to convert CSV files to searchable and sortable HTML table.

  •    Python

Simple command-line utility to convert CSV files to searchable and sortable HTML table. Supports large datasets and horizontal scrolling for large number of columns. Here is a demo of sample csv file converted to HTML table.






We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.