On the Origins of Sampling Bias: Implications on Fairness Measurement and Mitigation
- URL: http://arxiv.org/abs/2503.17956v1
- Date: Sun, 23 Mar 2025 06:23:07 GMT
- Title: On the Origins of Sampling Bias: Implications on Fairness Measurement and Mitigation
- Authors: Sami Zhioua, Ruta Binkyte, Ayoub Ouni, Farah Barika Ktata,
- Abstract summary: Several sources of bias exist and it is assumed that bias resulting from machine learning is born equally by different groups.<n> Sampling bias, in particular, is inconsistently used in the literature to describe bias due to the sampling procedure.<n>We introduce clearly defined variants of sampling bias, namely, sample size bias ( SSB) and underrepresentation bias (URB)
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Accurately measuring discrimination is crucial to faithfully assessing fairness of trained machine learning (ML) models. Any bias in measuring discrimination leads to either amplification or underestimation of the existing disparity. Several sources of bias exist and it is assumed that bias resulting from machine learning is born equally by different groups (e.g. females vs males, whites vs blacks, etc.). If, however, bias is born differently by different groups, it may exacerbate discrimination against specific sub-populations. Sampling bias, in particular, is inconsistently used in the literature to describe bias due to the sampling procedure. In this paper, we attempt to disambiguate this term by introducing clearly defined variants of sampling bias, namely, sample size bias (SSB) and underrepresentation bias (URB). Through an extensive set of experiments on benchmark datasets and using mainstream learning algorithms, we expose relevant observations in several model training scenarios. The observations are finally framed as actionable recommendations for practitioners.
Related papers
- Explicit vs. Implicit: Investigating Social Bias in Large Language Models through Self-Reflection [5.800102484016876]
Large Language Models (LLMs) have been shown to exhibit various biases and stereotypes in their generated content.<n>This paper presents a systematic framework grounded in social psychology theories to investigate explicit and implicit biases in LLMs.
arXiv Detail & Related papers (2025-01-04T14:08:52Z) - How far can bias go? -- Tracing bias from pretraining data to alignment [54.51310112013655]
This study examines the correlation between gender-occupation bias in pre-training data and their manifestation in LLMs.<n>Our findings reveal that biases present in pre-training data are amplified in model outputs.
arXiv Detail & Related papers (2024-11-28T16:20:25Z) - Assessing Bias in Metric Models for LLM Open-Ended Generation Bias Benchmarks [3.973239756262797]
This study examines such biases in open-generation benchmarks like BOLD and SAGED.
Results reveal unequal treatment of demographic descriptors, calling for more robust bias metric models.
arXiv Detail & Related papers (2024-10-14T20:08:40Z) - Revisiting the Dataset Bias Problem from a Statistical Perspective [72.94990819287551]
We study the "dataset bias" problem from a statistical standpoint.
We identify the main cause of the problem as the strong correlation between a class attribute u and a non-class attribute b.
We propose to mitigate dataset bias via either weighting the objective of each sample n by frac1p(u_n|b_n) or sampling that sample with a weight proportional to frac1p(u_n|b_n).
arXiv Detail & Related papers (2024-02-05T22:58:06Z) - Dissecting Causal Biases [0.0]
This paper focuses on a class of bias originating in the way training data is generated and/or collected.
Four sources of bias are considered, namely, confounding, selection, measurement, and interaction.
arXiv Detail & Related papers (2023-10-20T09:12:10Z) - Shedding light on underrepresentation and Sampling Bias in machine
learning [0.0]
We show how discrimination can be decomposed into variance, bias, and noise.
We challenge the commonly accepted mitigation approach that discrimination can be addressed by collecting more samples of the underrepresented group.
arXiv Detail & Related papers (2023-06-08T09:34:20Z) - On Comparing Fair Classifiers under Data Bias [42.43344286660331]
We study the effect of varying data biases on the accuracy and fairness of fair classifiers.
Our experiments show how to integrate a measure of data bias risk in the existing fairness dashboards for real-world deployments.
arXiv Detail & Related papers (2023-02-12T13:04:46Z) - Causally Testing Gender Bias in LLMs: A Case Study on Occupational Bias [33.99768156365231]
We introduce a causal formulation for bias measurement in generative language models.
We propose a benchmark called OccuGender, with a bias-measuring procedure to investigate occupational gender bias.
The results show that these models exhibit substantial occupational gender bias.
arXiv Detail & Related papers (2022-12-20T22:41:24Z) - BLIND: Bias Removal With No Demographics [29.16221451643288]
We introduce BLIND, a method for bias removal with no prior knowledge of the demographics in the dataset.
While training a model on a downstream task, BLIND detects biased samples using an auxiliary model that predicts the main model's success, and down-weights those samples during the training process.
Experiments with racial and gender biases in sentiment classification and occupation classification tasks demonstrate that BLIND mitigates social biases without relying on a costly demographic annotation process.
arXiv Detail & Related papers (2022-12-20T18:59:42Z) - The SAME score: Improved cosine based bias score for word embeddings [49.75878234192369]
We introduce SAME, a novel bias score for semantic bias in embeddings.
We show that SAME is capable of measuring semantic bias and identify potential causes for social bias in downstream tasks.
arXiv Detail & Related papers (2022-03-28T09:28:13Z) - Fighting Fire with Fire: Contrastive Debiasing without Bias-free Data
via Generative Bias-transformation [31.944147533327058]
Contrastive Debiasing via Generative Bias-transformation (CDvG)
We propose a novel method, Contrastive Debiasing via Generative Bias-transformation (CDvG), which works without explicit bias labels or bias-free samples.
Our method demonstrates superior performance compared to prior approaches, especially when bias-free samples are scarce or absent.
arXiv Detail & Related papers (2021-12-02T07:16:06Z) - Fairness-aware Class Imbalanced Learning [57.45784950421179]
We evaluate long-tail learning methods for tweet sentiment and occupation classification.
We extend a margin-loss based approach with methods to enforce fairness.
arXiv Detail & Related papers (2021-09-21T22:16:30Z) - Balancing out Bias: Achieving Fairness Through Training Reweighting [58.201275105195485]
Bias in natural language processing arises from models learning characteristics of the author such as gender and race.
Existing methods for mitigating and measuring bias do not directly account for correlations between author demographics and linguistic variables.
This paper introduces a very simple but highly effective method for countering bias using instance reweighting.
arXiv Detail & Related papers (2021-09-16T23:40:28Z) - LOGAN: Local Group Bias Detection by Clustering [86.38331353310114]
We argue that evaluating bias at the corpus level is not enough for understanding how biases are embedded in a model.
We propose LOGAN, a new bias detection technique based on clustering.
Experiments on toxicity classification and object classification tasks show that LOGAN identifies bias in a local region.
arXiv Detail & Related papers (2020-10-06T16:42:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.