Data Representativity for Machine Learning and AI Systems
- URL: http://arxiv.org/abs/2203.04706v1
- Date: Wed, 9 Mar 2022 13:34:52 GMT
- Title: Data Representativity for Machine Learning and AI Systems
- Authors: Line H. Clemmensen, Rune D. Kj{\ae}rsgaard
- Abstract summary: Data representativity is crucial when drawing inference from data through machine learning models.
This paper analyzes data representativity in scientific literature related to AI and sampling.
- Score: 2.588973722689844
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Data representativity is crucial when drawing inference from data through
machine learning models. Scholars have increased focus on unraveling the bias
and fairness in the models, also in relation to inherent biases in the input
data. However, limited work exists on the representativity of samples
(datasets) for appropriate inference in AI systems. This paper analyzes data
representativity in scientific literature related to AI and sampling, and gives
a brief overview of statistical sampling methodology from disciplines like
sampling of physical materials, experimental design, survey analysis, and
observational studies. Different notions of a 'representative sample' exist in
past and present literature. In particular, the contrast between the notion of
a representative sample in the sense of coverage of the input space, versus a
representative sample as a miniature of the target population is of relevance
when building AI systems. Using empirical demonstrations on US Census data, we
demonstrate that the first is useful for providing equality and demographic
parity, and is more robust to distribution shifts, whereas the latter notion is
useful in situations where the purpose is to make historical inference or draw
inference about the underlying population in general, or make better
predictions for the majority in the underlying population. We propose a
framework of questions for creating and documenting data, with data
representativity in mind, as an addition to existing datasheets for datasets.
Finally, we will also like to call for caution of implicit, in addition to
explicit, use of a notion of data representativeness without specific
clarification.
Related papers
- Leveraging Prototypical Representations for Mitigating Social Bias without Demographic Information [50.29934517930506]
DAFair is a novel approach to address social bias in language models.
We leverage prototypical demographic texts and incorporate a regularization term during the fine-tuning process to mitigate bias.
arXiv Detail & Related papers (2024-03-14T15:58:36Z) - Metrics for Dataset Demographic Bias: A Case Study on Facial Expression Recognition [4.336779198334903]
One of the most prominent types of demographic bias are statistical imbalances in the representation of demographic groups in the datasets.
We develop a taxonomy for the classification of these metrics, providing a practical guide for the selection of appropriate metrics.
The paper provides valuable insights for researchers in AI and related fields to mitigate dataset bias and improve the fairness and accuracy of AI models.
arXiv Detail & Related papers (2023-03-28T11:04:18Z) - The Trade-off between Universality and Label Efficiency of
Representations from Contrastive Learning [32.15608637930748]
We show that there exists a trade-off between the two desiderata so that one may not be able to achieve both simultaneously.
We provide analysis using a theoretical data model and show that, while more diverse pre-training data result in more diverse features for different tasks, it puts less emphasis on task-specific features.
arXiv Detail & Related papers (2023-02-28T22:14:33Z) - Deep Learning on a Healthy Data Diet: Finding Important Examples for
Fairness [15.210232622716129]
Data-driven predictive solutions predominant in commercial applications tend to suffer from biases and stereotypes.
Data augmentation reduces gender bias by adding counterfactual examples to the training dataset.
We show that some of the examples in the augmented dataset can be not important or even harmful for fairness.
arXiv Detail & Related papers (2022-11-20T22:42:30Z) - D-BIAS: A Causality-Based Human-in-the-Loop System for Tackling
Algorithmic Bias [57.87117733071416]
We propose D-BIAS, a visual interactive tool that embodies human-in-the-loop AI approach for auditing and mitigating social biases.
A user can detect the presence of bias against a group by identifying unfair causal relationships in the causal network.
For each interaction, say weakening/deleting a biased causal edge, the system uses a novel method to simulate a new (debiased) dataset.
arXiv Detail & Related papers (2022-08-10T03:41:48Z) - Representation Bias in Data: A Survey on Identification and Resolution
Techniques [26.142021257838564]
Data-driven algorithms are only as good as the data they work with, while data sets, especially social data, often fail to represent minorities adequately.
Representation Bias in data can happen due to various reasons ranging from historical discrimination to selection and sampling biases in the data acquisition and preparation methods.
This paper reviews the literature on identifying and resolving representation bias as a feature of a data set, independent of how consumed later.
arXiv Detail & Related papers (2022-03-22T16:30:22Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - A survey on datasets for fairness-aware machine learning [6.962333053044713]
A large variety of fairness-aware machine learning solutions have been proposed.
In this paper, we overview real-world datasets used for fairness-aware machine learning.
For a deeper understanding of bias and fairness in the datasets, we investigate the interesting relationships using exploratory analysis.
arXiv Detail & Related papers (2021-10-01T16:54:04Z) - On the Efficacy of Adversarial Data Collection for Question Answering:
Results from a Large-Scale Randomized Study [65.17429512679695]
In adversarial data collection (ADC), a human workforce interacts with a model in real time, attempting to produce examples that elicit incorrect predictions.
Despite ADC's intuitive appeal, it remains unclear when training on adversarial datasets produces more robust models.
arXiv Detail & Related papers (2021-06-02T00:48:33Z) - Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process.
We generate a representative as well as fair version of the UCI Adult census data set.
We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z) - Generative Counterfactuals for Neural Networks via Attribute-Informed
Perturbation [51.29486247405601]
We design a framework to generate counterfactuals for raw data instances with the proposed Attribute-Informed Perturbation (AIP)
By utilizing generative models conditioned with different attributes, counterfactuals with desired labels can be obtained effectively and efficiently.
Experimental results on real-world texts and images demonstrate the effectiveness, sample quality as well as efficiency of our designed framework.
arXiv Detail & Related papers (2021-01-18T08:37:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.