What is the Value of Data? On Mathematical Methods for Data Quality
Estimation
- URL: http://arxiv.org/abs/2001.03464v2
- Date: Tue, 12 May 2020 14:53:08 GMT
- Title: What is the Value of Data? On Mathematical Methods for Data Quality
Estimation
- Authors: Netanel Raviv, Siddharth Jain, Jehoshua Bruck
- Abstract summary: We propose a formal definition for the quality of a given dataset.
We assess a dataset's quality by a quantity we call the expected diameter.
- Score: 35.75162309592681
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data is one of the most important assets of the information age, and its
societal impact is undisputed. Yet, rigorous methods of assessing the quality
of data are lacking. In this paper, we propose a formal definition for the
quality of a given dataset. We assess a dataset's quality by a quantity we call
the expected diameter, which measures the expected disagreement between two
randomly chosen hypotheses that explain it, and has recently found applications
in active learning. We focus on Boolean hyperplanes, and utilize a collection
of Fourier analytic, algebraic, and probabilistic methods to come up with
theoretical guarantees and practical solutions for the computation of the
expected diameter. We also study the behaviour of the expected diameter on
algebraically structured datasets, conduct experiments that validate this
notion of quality, and demonstrate the feasibility of our techniques.
Related papers
- A Novel Metric for Measuring Data Quality in Classification Applications
(extended version) [0.0]
We introduce and explain a novel metric to measure data quality.
This metric is based on the correlated evolution between the classification performance and the deterioration of data.
We provide an interpretation of each criterion and examples of assessment levels.
arXiv Detail & Related papers (2023-12-13T11:20:09Z) - Assessing Dataset Quality Through Decision Tree Characteristics in
Autoencoder-Processed Spaces [0.30458514384586394]
We show the profound impact of dataset quality on model training and performance.
Our findings underscore the importance of appropriate feature selection, adequate data volume, and data quality.
This research offers valuable insights into data assessment practices, contributing to the development of more accurate and robust machine learning models.
arXiv Detail & Related papers (2023-06-27T11:33:31Z) - LAVA: Data Valuation without Pre-Specified Learning Algorithms [20.578106028270607]
We introduce a new framework that can value training data in a way that is oblivious to the downstream learning algorithm.
We develop a proxy for the validation performance associated with a training set based on a non-conventional class-wise Wasserstein distance between training and validation sets.
We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions.
arXiv Detail & Related papers (2023-04-28T19:05:16Z) - Systematic Evaluation of Predictive Fairness [60.0947291284978]
Mitigating bias in training on biased datasets is an important open problem.
We examine the performance of various debiasing methods across multiple tasks.
We find that data conditions have a strong influence on relative model performance.
arXiv Detail & Related papers (2022-10-17T05:40:13Z) - Information-Theoretic Odometry Learning [83.36195426897768]
We propose a unified information theoretic framework for learning-motivated methods aimed at odometry estimation.
The proposed framework provides an elegant tool for performance evaluation and understanding in information-theoretic language.
arXiv Detail & Related papers (2022-03-11T02:37:35Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Statistical Learning to Operationalize a Domain Agnostic Data Quality
Scoring [8.864453148536061]
The research study provides an automated platform which takes an incoming dataset and metadata to provide the DQ score, report and label.
The results of this study would be useful to data scientists as the value of this quality label would instill confidence before deploying the data for his/her respective practical application.
arXiv Detail & Related papers (2021-08-16T12:20:57Z) - Data Quality Evaluation using Probability Models [0.0]
It is shown that for the data examined, the ability to predict the quality of data based on simple good/bad pre-labelled learning examples is accurate.
arXiv Detail & Related papers (2020-09-14T18:12:19Z) - Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management.
We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.