Representation Bias in Data: A Survey on Identification and Resolution
Techniques
- URL: http://arxiv.org/abs/2203.11852v2
- Date: Sat, 18 Mar 2023 18:04:02 GMT
- Title: Representation Bias in Data: A Survey on Identification and Resolution
Techniques
- Authors: Nima Shahbazi, Yin Lin, Abolfazl Asudeh, H. V. Jagadish
- Abstract summary: Data-driven algorithms are only as good as the data they work with, while data sets, especially social data, often fail to represent minorities adequately.
Representation Bias in data can happen due to various reasons ranging from historical discrimination to selection and sampling biases in the data acquisition and preparation methods.
This paper reviews the literature on identifying and resolving representation bias as a feature of a data set, independent of how consumed later.
- Score: 26.142021257838564
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data-driven algorithms are only as good as the data they work with, while
data sets, especially social data, often fail to represent minorities
adequately. Representation Bias in data can happen due to various reasons
ranging from historical discrimination to selection and sampling biases in the
data acquisition and preparation methods. Given that "bias in, bias out", one
cannot expect AI-based solutions to have equitable outcomes for societal
applications, without addressing issues such as representation bias. While
there has been extensive study of fairness in machine learning models,
including several review papers, bias in the data has been less studied. This
paper reviews the literature on identifying and resolving representation bias
as a feature of a data set, independent of how consumed later. The scope of
this survey is bounded to structured (tabular) and unstructured (e.g., image,
text, graph) data. It presents taxonomies to categorize the studied techniques
based on multiple design dimensions and provides a side-by-side comparison of
their properties. There is still a long way to fully address representation
bias issues in data. The authors hope that this survey motivates researchers to
approach these challenges in the future by observing existing work within their
respective domains.
Related papers
- DSAP: Analyzing Bias Through Demographic Comparison of Datasets [4.8741052091630985]
We propose DSAP (Demographic Similarity from Auxiliary Profiles), a two-step methodology for comparing the demographic composition of two datasets.
DSAP can be deployed in three key applications: to detect and characterize demographic blind spots and bias issues across datasets, to measure dataset demographic bias in single datasets, and to measure dataset demographic shift in deployment scenarios.
An essential feature of DSAP is its ability to robustly analyze datasets without explicit demographic labels, offering simplicity and interpretability for a wide range of situations.
arXiv Detail & Related papers (2023-12-22T11:51:20Z) - Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and
Beyond [93.96982273042296]
Vision-language (VL) understanding tasks evaluate models' comprehension of complex visual scenes through multiple-choice questions.
We have identified two dataset biases that models can exploit as shortcuts to resolve various VL tasks correctly without proper understanding.
We propose Adversarial Data Synthesis (ADS) to generate synthetic training and debiased evaluation data.
We then introduce Intra-sample Counterfactual Training (ICT) to assist models in utilizing the synthesized training data, particularly the counterfactual data, via focusing on intra-sample differentiation.
arXiv Detail & Related papers (2023-10-23T08:09:42Z) - Metrics for Dataset Demographic Bias: A Case Study on Facial Expression Recognition [4.336779198334903]
One of the most prominent types of demographic bias are statistical imbalances in the representation of demographic groups in the datasets.
We develop a taxonomy for the classification of these metrics, providing a practical guide for the selection of appropriate metrics.
The paper provides valuable insights for researchers in AI and related fields to mitigate dataset bias and improve the fairness and accuracy of AI models.
arXiv Detail & Related papers (2023-03-28T11:04:18Z) - D-BIAS: A Causality-Based Human-in-the-Loop System for Tackling
Algorithmic Bias [57.87117733071416]
We propose D-BIAS, a visual interactive tool that embodies human-in-the-loop AI approach for auditing and mitigating social biases.
A user can detect the presence of bias against a group by identifying unfair causal relationships in the causal network.
For each interaction, say weakening/deleting a biased causal edge, the system uses a novel method to simulate a new (debiased) dataset.
arXiv Detail & Related papers (2022-08-10T03:41:48Z) - Assessing Demographic Bias Transfer from Dataset to Model: A Case Study
in Facial Expression Recognition [1.5340540198612824]
Two metrics focus on the representational and stereotypical bias of the dataset, and the third one on the residual bias of the trained model.
We demonstrate the usefulness of the metrics by applying them to a FER problem based on the popular Affectnet dataset.
arXiv Detail & Related papers (2022-05-20T09:40:42Z) - Data Representativity for Machine Learning and AI Systems [2.588973722689844]
Data representativity is crucial when drawing inference from data through machine learning models.
This paper analyzes data representativity in scientific literature related to AI and sampling.
arXiv Detail & Related papers (2022-03-09T13:34:52Z) - Balancing out Bias: Achieving Fairness Through Training Reweighting [58.201275105195485]
Bias in natural language processing arises from models learning characteristics of the author such as gender and race.
Existing methods for mitigating and measuring bias do not directly account for correlations between author demographics and linguistic variables.
This paper introduces a very simple but highly effective method for countering bias using instance reweighting.
arXiv Detail & Related papers (2021-09-16T23:40:28Z) - A Survey on Bias in Visual Datasets [17.79365832663837]
Computer Vision (CV) has achieved remarkable results, outperforming humans in several tasks.
CV systems highly depend on the data they are fed with and can learn and amplify biases within such data.
Yet, to date there is no comprehensive survey on bias in visual datasets.
arXiv Detail & Related papers (2021-07-16T14:16:52Z) - On the Efficacy of Adversarial Data Collection for Question Answering:
Results from a Large-Scale Randomized Study [65.17429512679695]
In adversarial data collection (ADC), a human workforce interacts with a model in real time, attempting to produce examples that elicit incorrect predictions.
Despite ADC's intuitive appeal, it remains unclear when training on adversarial datasets produces more robust models.
arXiv Detail & Related papers (2021-06-02T00:48:33Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z) - REVISE: A Tool for Measuring and Mitigating Bias in Visual Datasets [64.76453161039973]
REVISE (REvealing VIsual biaSEs) is a tool that assists in the investigation of a visual dataset.
It surfacing potential biases along three dimensions: (1) object-based, (2) person-based, and (3) geography-based.
arXiv Detail & Related papers (2020-04-16T23:54:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.