Related papers: Intrinsic Meets Extrinsic Fairness: Assessing the Downstream Impact of Bias Mitigation in Large Language Models

Intrinsic Meets Extrinsic Fairness: Assessing the Downstream Impact of Bias Mitigation in Large Language Models

URL: http://arxiv.org/abs/2509.16462v1
Date: Fri, 19 Sep 2025 22:59:55 GMT
Title: Intrinsic Meets Extrinsic Fairness: Assessing the Downstream Impact of Bias Mitigation in Large Language Models
Authors: 'Mina Arzaghi', 'Alireza Dehghanpour Farashah', 'Florian Carichon', ' Golnoosh Farnadi',
Abstract summary: Large Language Models (LLMs) exhibit socio-economic biases that can propagate into downstream tasks.<n>We present a unified evaluation framework to compare intrinsic bias mitigation via concept unlearning with extrinsic bias mitigation via counterfactual data augmentation.<n>Our results show that intrinsic bias mitigation through unlearning reduces intrinsic gender bias by up to 94.9%, while also improving downstream task fairness metrics, such as demographic parity by up to 82%, without compromising accuracy.
Score: 11.396244643030983
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) exhibit socio-economic biases that can propagate into downstream tasks. While prior studies have questioned whether intrinsic bias in LLMs affects fairness at the downstream task level, this work empirically investigates the connection. We present a unified evaluation framework to compare intrinsic bias mitigation via concept unlearning with extrinsic bias mitigation via counterfactual data augmentation (CDA). We examine this relationship through real-world financial classification tasks, including salary prediction, employment status, and creditworthiness assessment. Using three open-source LLMs, we evaluate models both as frozen embedding extractors and as fine-tuned classifiers. Our results show that intrinsic bias mitigation through unlearning reduces intrinsic gender bias by up to 94.9%, while also improving downstream task fairness metrics, such as demographic parity by up to 82%, without compromising accuracy. Our framework offers practical guidance on where mitigation efforts can be most effective and highlights the importance of applying early-stage mitigation before downstream deployment.

Related papers

Assessing Judging Bias in Large Reasoning Models: An Empirical Study [99.86300466350013]
Large Reasoning Models (LRMs) like DeepSeek-R1 and OpenAI-o1 have demonstrated remarkable reasoning capabilities.<n>We present a benchmark comparing judging biases between LLMs and LRMs across both subjective preference-alignment datasets and objective fact-based datasets.
arXiv Detail & Related papers (2025-04-14T07:14:27Z)
Fairness Mediator: Neutralize Stereotype Associations to Mitigate Bias in Large Language Models [66.5536396328527]
LLMs inadvertently absorb spurious correlations from training data, leading to stereotype associations between biased concepts and specific social groups.<n>We propose Fairness Mediator (FairMed), a bias mitigation framework that neutralizes stereotype associations.<n>Our framework comprises two main components: a stereotype association prober and an adversarial debiasing neutralizer.
arXiv Detail & Related papers (2025-04-10T14:23:06Z)
Evaluating Bias in LLMs for Job-Resume Matching: Gender, Race, and Education [8.235367170516769]
Large Language Models (LLMs) offer the potential to automate hiring by matching job descriptions with candidate resumes.<n>However, biases inherent in these models may lead to unfair hiring practices, reinforcing societal prejudices and undermining workplace diversity.<n>This study examines the performance and fairness of LLMs in job-resume matching tasks within the English language and U.S. context.
arXiv Detail & Related papers (2025-03-24T22:11:22Z)
How far can bias go? -- Tracing bias from pretraining data to alignment [54.51310112013655]
This study examines the correlation between gender-occupation bias in pre-training data and their manifestation in LLMs.<n>Our findings reveal that biases present in pre-training data are amplified in model outputs.
arXiv Detail & Related papers (2024-11-28T16:20:25Z)
With a Grain of SALT: Are LLMs Fair Across Social Dimensions? [3.5001789247699535]
This paper presents a systematic analysis of biases in open-source Large Language Models (LLMs) across gender, religion, and race.<n>We use the SALT dataset, which incorporates five distinct bias triggers: General Debate, Positioned Debate, Career Advice, Problem Solving, and CV Generation.<n>Our findings reveal consistent polarization across models, with certain demographic groups receiving systematically favorable or unfavorable treatment.
arXiv Detail & Related papers (2024-10-16T12:22:47Z)
Unboxing Occupational Bias: Grounded Debiasing of LLMs with U.S. Labor Data [9.90951705988724]
Large Language Models (LLM) are prone to inheriting and amplifying societal biases. LLM bias can have far-reaching consequences, leading to unfair practices and exacerbating social inequalities.
arXiv Detail & Related papers (2024-08-20T23:54:26Z)
Identifying and Mitigating Social Bias Knowledge in Language Models [52.52955281662332]
We propose a novel debiasing approach, Fairness Stamp (FAST), which enables fine-grained calibration of individual social biases.<n>FAST surpasses state-of-the-art baselines with superior debiasing performance.<n>This highlights the potential of fine-grained debiasing strategies to achieve fairness in large language models.
arXiv Detail & Related papers (2024-08-07T17:14:58Z)
Towards Understanding Task-agnostic Debiasing Through the Lenses of Intrinsic Bias and Forgetfulness [10.081447621656523]
The impact on language modeling ability can be alleviated given a high-quality and long-contextualized debiasing corpus. The effectiveness of task-agnostic debiasing hinges on the quantitative bias level of both the task-specific data used for downstream applications and the debiased model. We propose a novel framework which can Propagate Socially-fair Debiasing to Downstream Fine-tuning, ProSocialTuning.
arXiv Detail & Related papers (2024-06-06T15:11:11Z)
Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis [86.49858739347412]
Large Language Models (LLMs) have sparked intense debate regarding the prevalence of bias in these models and its mitigation. We propose a prompt-based method for the extraction of confounding and mediating attributes which contribute to the decision process. We find that the observed disparate treatment can at least in part be attributed to confounding and mitigating attributes and model misalignment.
arXiv Detail & Related papers (2023-11-15T00:02:25Z)
Bias and Fairness in Large Language Models: A Survey [73.87651986156006]
We present a comprehensive survey of bias evaluation and mitigation techniques for large language models (LLMs) We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing. We then unify the literature by proposing three intuitive, two for bias evaluation, and one for mitigation.
arXiv Detail & Related papers (2023-09-02T00:32:55Z)
On Transferability of Bias Mitigation Effects in Language Model Fine-Tuning [30.833538367971872]
Fine-tuned language models have been shown to exhibit biases against protected groups in a host of modeling tasks. Previous works focus on detecting these biases, reducing bias in data representations, and using auxiliary training objectives to mitigate bias during fine-tuning. We explore the feasibility and benefits of upstream bias mitigation (UBM) for reducing bias on downstream tasks.
arXiv Detail & Related papers (2020-10-24T10:36:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.