Data Quality Evaluation using Probability Models
- URL: http://arxiv.org/abs/2009.06672v1
- Date: Mon, 14 Sep 2020 18:12:19 GMT
- Title: Data Quality Evaluation using Probability Models
- Authors: Allen ONeill
- Abstract summary: It is shown that for the data examined, the ability to predict the quality of data based on simple good/bad pre-labelled learning examples is accurate.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper discusses an approach with machine-learning probability models to
evaluate the difference between good and bad data quality in a dataset. A
decision tree algorithm is used to predict data quality based on no domain
knowledge of the datasets under examination. It is shown that for the data
examined, the ability to predict the quality of data based on simple good/bad
pre-labelled learning examples is accurate, however in general it may not be
sufficient for useful production data quality assessment.
Related papers
- Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs [11.24476329991465]
Training large language models (LLMs) for external tool usage is a rapidly expanding field.
The absence of systematic data quality checks poses complications for properly training and testing models.
We propose two approaches for assessing the reliability of data for training LLMs to use external tools.
arXiv Detail & Related papers (2024-09-24T17:20:02Z) - Analyzing Dataset Annotation Quality Management in the Wild [63.07224587146207]
Even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, biases, or artifacts.
While practices and guidelines regarding dataset creation projects exist, large-scale analysis has yet to be performed on how quality management is conducted.
arXiv Detail & Related papers (2023-07-16T21:22:40Z) - Statistical Dataset Evaluation: Reliability, Difficulty, and Validity [18.36931975072938]
We propose a model-agnostic dataset evaluation framework for automatic dataset quality evaluation.
We seek the statistical properties of the datasets and address three fundamental dimensions: reliability, difficulty, and validity.
arXiv Detail & Related papers (2022-12-19T06:55:42Z) - Exploring Predictive Uncertainty and Calibration in NLP: A Study on the
Impact of Method & Data Scarcity [7.3372471678239215]
We assess the quality of estimates from a wide array of approaches and their dependence on the amount of available data.
We find that while approaches based on pre-trained models and ensembles achieve the best results overall, the quality of uncertainty estimates can surprisingly suffer with more data.
arXiv Detail & Related papers (2022-10-20T15:42:02Z) - Systematic Evaluation of Predictive Fairness [60.0947291284978]
Mitigating bias in training on biased datasets is an important open problem.
We examine the performance of various debiasing methods across multiple tasks.
We find that data conditions have a strong influence on relative model performance.
arXiv Detail & Related papers (2022-10-17T05:40:13Z) - Predicting Seriousness of Injury in a Traffic Accident: A New Imbalanced
Dataset and Benchmark [62.997667081978825]
The paper introduces a new dataset to assess the performance of machine learning algorithms in the prediction of the seriousness of injury in a traffic accident.
The dataset is created by aggregating publicly available datasets from the UK Department for Transport.
arXiv Detail & Related papers (2022-05-20T21:15:26Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Statistical Learning to Operationalize a Domain Agnostic Data Quality
Scoring [8.864453148536061]
The research study provides an automated platform which takes an incoming dataset and metadata to provide the DQ score, report and label.
The results of this study would be useful to data scientists as the value of this quality label would instill confidence before deploying the data for his/her respective practical application.
arXiv Detail & Related papers (2021-08-16T12:20:57Z) - Doing Great at Estimating CATE? On the Neglected Assumptions in
Benchmark Comparisons of Treatment Effect Estimators [91.3755431537592]
We show that even in arguably the simplest setting, estimation under ignorability assumptions can be misleading.
We consider two popular machine learning benchmark datasets for evaluation of heterogeneous treatment effect estimators.
We highlight that the inherent characteristics of the benchmark datasets favor some algorithms over others.
arXiv Detail & Related papers (2021-07-28T13:21:27Z) - How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance.
We formulate a quality measure for the data set, which we refer to as $rho$-gap.
We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z) - What is the Value of Data? On Mathematical Methods for Data Quality
Estimation [35.75162309592681]
We propose a formal definition for the quality of a given dataset.
We assess a dataset's quality by a quantity we call the expected diameter.
arXiv Detail & Related papers (2020-01-09T18:56:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.