Displaying 1 to 13 from 13 results

pandas-videos - Jupyter notebook and datasets from the pandas Q&A video series

  •    Jupyter

Read about the series, and view all of the videos on one page: Easier data analysis in Python with pandas.

janitor - simple tools for data cleaning in R

  •    R

Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets. janitor has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimized for user-friendliness. Advanced R users can already do everything covered here, but with janitor they can do it faster and save their thinking for the fun stuff.




scrubr - Clean species occurrence records

  •    R

A note about examples: We think that using a piping workflow with %>% makes code easier to build up, and easier to understand. However, in some examples we provide examples without the pipe to demonstrate traditional usage.

taxa - taxonomic classes for R

  •    R

taxa defines taxonomic classes and functions to manipulate them. The goal is to use these classes as low level fundamental taxonomic classes that other R packages can build on and use.There are a few optional classes used to store information in other classes. In most cases, these can be replaced with simple character values but using them provides more information and potential functionality.

akvo-lumen - Make sense of your data

  •    Javascript

An open-source, easy to use data mashup, analysis and publishing platform.


DTCleaner - DTCleaner: data cleaning using multi-target decision trees.

  •    Java

It has been recognized that poor data quality can have multiple negative impact to enterprises [1]. Businesses operating on dirty data are in risk of causing large amount of financial loses. Maintaining data quality can also increases operational cost as business would need to spend time and resources to detect erroneous data and correct them. As data grows bigger these days, data repairing has became an important problem and an important research area. DTCleaner produces multi-target decision trees for the purpose of data cleaning. It's built for detecting erroneous tuples in the dataset based on given set of conditional functional dependencies (CFDs) and building a classification model to predict erroneous tuples such that the "cleaned" dataset satisfies the CFDs, and semantically correct.

refinr - Cluster and merge similar char values: an R implementation of Open Refine clustering algorithms

  •    C++

refinr is designed to cluster and merge similar values within a character vector. It features two functions that are implementations of clustering algorithms from the open source software OpenRefine. The cluster methods used are key collision and ngram fingerprint (more info on these here). In addition, there are a few add-on features included, to make the clustering/merging functions more useful. These include approximate string matching to allow for merging despite minor mispellings, the option to pass a dictionary vector to dictate edit values, and the option to pass a vector of strings to ignore during the clustering process.

pgdedupe - A simple command line interface to the datamade/dedupe library.

  •    Jupyter

A work-in-progress to provide a standard interface for deduplication of large databases with custom pre-processing and post-processing steps. In addition to running a database-level deduplication with dedupe, this script adds custom pre- and post-processing steps to improve the run-time and results, making this a hybrid between fuzzy matching and record linkage.

dataMaid - An R package for data screening

  •    HTML

dataMaid is an R package for documenting and creating reports on data cleanliness. A super simple way to get started is to load the package and use the makeDataReport function on a data frame (if you try to generate several reports for the same data, then it may be necessary to add the replace=TRUE argument to overwrite the existing report).