Data Quality Toolkit: Automatic assessment of data quality and
remediation for machine learning datasets
- URL: http://arxiv.org/abs/2108.05935v1
- Date: Thu, 12 Aug 2021 19:22:27 GMT
- Title: Data Quality Toolkit: Automatic assessment of data quality and
remediation for machine learning datasets
- Authors: Nitin Gupta, Hima Patel, Shazia Afzal, Naveen Panwar, Ruhi Sharma
Mittal, Shanmukha Guttula, Abhinav Jain, Lokesh Nagalapatti, Sameep Mehta,
Sandeep Hans, Pranay Lohia, Aniya Aggarwal, Diptikalyan Saha
- Abstract summary: The Data Quality Toolkit for machine learning is a library of some key quality metrics and relevant remediation techniques.
It can reduce the turn-around times of data preparation pipelines and streamline the data quality assessment process.
- Score: 11.417891017429882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The quality of training data has a huge impact on the efficiency, accuracy
and complexity of machine learning tasks. Various tools and techniques are
available that assess data quality with respect to general cleaning and
profiling checks. However these techniques are not applicable to detect data
issues in the context of machine learning tasks, like noisy labels, existence
of overlapping classes etc. We attempt to re-look at the data quality issues in
the context of building a machine learning pipeline and build a tool that can
detect, explain and remediate issues in the data, and systematically and
automatically capture all the changes applied to the data. We introduce the
Data Quality Toolkit for machine learning as a library of some key quality
metrics and relevant remediation techniques to analyze and enhance the
readiness of structured training datasets for machine learning projects. The
toolkit can reduce the turn-around times of data preparation pipelines and
streamline the data quality assessment process. Our toolkit is publicly
available via IBM API Hub [1] platform, any developer can assess the data
quality using the IBM's Data Quality for AI apis [2]. Detailed tutorials are
also available on IBM Learning Path [3].
Related papers
- Automatic Dataset Construction (ADC): Sample Collection, Data Curation, and Beyond [38.89457061559469]
We propose an innovative methodology that automates dataset creation with negligible cost and high efficiency.
We provide open-source software that incorporates existing methods for label error detection, robust learning under noisy and biased data.
We design three benchmark datasets focused on label noise detection, label noise learning, and class-imbalanced learning.
arXiv Detail & Related papers (2024-08-21T04:45:12Z) - An Integrated Data Processing Framework for Pretraining Foundation Models [57.47845148721817]
Researchers and practitioners often have to manually curate datasets from difference sources.
We propose a data processing framework that integrates a Processing Module and an Analyzing Module.
The proposed framework is easy to use and highly flexible.
arXiv Detail & Related papers (2024-02-26T07:22:51Z) - A Systematic Review of Available Datasets in Additive Manufacturing [56.684125592242445]
In-situ monitoring incorporating visual and other sensor technologies allows the collection of extensive datasets during the Additive Manufacturing process.
These datasets have potential for determining the quality of the manufactured output and the detection of defects through the use of Machine Learning.
This systematic review investigates the availability of open image-based datasets originating from AM processes that align with a number of pre-defined selection criteria.
arXiv Detail & Related papers (2024-01-27T16:13:32Z) - Data Diversity Matters for Robust Instruction Tuning [93.87078483250782]
Recent works have shown that by curating high quality and diverse instruction tuning datasets, we can significantly improve instruction-following capabilities.
We propose a new algorithm, Quality-Diversity Instruction Tuning (QDIT) to control dataset diversity and quality.
We validate the performance of QDIT on several large scale instruction tuning datasets, where we find it can substantially improve worst and average case performance.
arXiv Detail & Related papers (2023-11-21T19:12:18Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - CLASSify: A Web-Based Tool for Machine Learning [0.0]
This article presents an automated tool for machine learning classification problems to simplify the process of training models and producing results while providing informative visualizations and insights into the data.
We present CLASSify, an open-source tool for solving classification problems without the need for knowledge of machine learning.
arXiv Detail & Related papers (2023-10-05T15:51:36Z) - QI2 -- an Interactive Tool for Data Quality Assurance [63.379471124899915]
The planned AI Act from the European commission defines challenging legal requirements for data quality.
We introduce a novel approach that supports the data quality assurance process of multiple data quality aspects.
arXiv Detail & Related papers (2023-07-07T07:06:38Z) - Assessing Dataset Quality Through Decision Tree Characteristics in
Autoencoder-Processed Spaces [0.30458514384586394]
We show the profound impact of dataset quality on model training and performance.
Our findings underscore the importance of appropriate feature selection, adequate data volume, and data quality.
This research offers valuable insights into data assessment practices, contributing to the development of more accurate and robust machine learning models.
arXiv Detail & Related papers (2023-06-27T11:33:31Z) - Fix your Models by Fixing your Datasets [0.6058427379240697]
Current machine learning tools lack streamlined processes for improving the data quality.
We introduce a systematic framework for finding noisy or mislabelled samples in the dataset.
We demonstrate the efficacy of our framework on public as well as private enterprise datasets of two Fortune 500 companies.
arXiv Detail & Related papers (2021-12-15T02:41:50Z) - Automated Machine Learning Techniques for Data Streams [91.3755431537592]
This paper surveys the state-of-the-art open-source AutoML tools, applies them to data collected from streams, and measures how their performance changes over time.
The results show that off-the-shelf AutoML tools can provide satisfactory results but in the presence of concept drift, detection or adaptation techniques have to be applied to maintain the predictive accuracy over time.
arXiv Detail & Related papers (2021-06-14T11:42:46Z) - Data Curation and Quality Assurance for Machine Learning-based Cyber
Intrusion Detection [1.0276024900942873]
This article first summarizes existing machine learning-based intrusion detection systems and the datasets used for building these systems.
The experimental results show that BERT and GPT were the best algorithms for HIDS on all of the datasets.
We then evaluate the data quality of the 11 datasets based on quality dimensions proposed in this paper to determine the best characteristics that a HIDS dataset should possess in order to yield the best possible result.
arXiv Detail & Related papers (2021-05-20T21:31:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.