Related papers: Intra-Fairness Dynamics: The Bias Spillover Effect in Targeted LLM Alignment

Intra-Fairness Dynamics: The Bias Spillover Effect in Targeted LLM Alignment

URL: http://arxiv.org/abs/2602.16438v1
Date: Wed, 18 Feb 2026 13:19:11 GMT
Title: Intra-Fairness Dynamics: The Bias Spillover Effect in Targeted LLM Alignment
Authors: Eva Paraschou, Line Harder Clemmensen, Sneha Das,
Abstract summary: We investigate how targeted gender alignment affects fairness across nine sensitive attributes in three state-of-the-art large language models (LLM)<n>Our findings reveal noticeable bias spillover: while aggregate results show improvements, context-aware analysis exposes significant degradations in ambiguous contexts.<n>We demonstrate that improving fairness along one attribute can inadvertently worsen disparities in others under uncertainty.
Score: 3.1670140283390276
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Conventional large language model (LLM) fairness alignment largely focuses on mitigating bias along single sensitive attributes, overlooking fairness as an inherently multidimensional and context-specific value. This approach risks creating systems that achieve narrow fairness metrics while exacerbating disparities along untargeted attributes, a phenomenon known as bias spillover. While extensively studied in machine learning, bias spillover remains critically underexplored in LLM alignment. In this work, we investigate how targeted gender alignment affects fairness across nine sensitive attributes in three state-of-the-art LLMs (Mistral 7B, Llama 3.1 8B, Qwen 2.5 7B). Using Direct Preference Optimization and the BBQ benchmark, we evaluate fairness under ambiguous and disambiguous contexts. Our findings reveal noticeable bias spillover: while aggregate results show improvements, context-aware analysis exposes significant degradations in ambiguous contexts, particularly for physical appearance ($p< 0.001$ across all models), sexual orientation, and disability status. We demonstrate that improving fairness along one attribute can inadvertently worsen disparities in others under uncertainty, highlighting the necessity of context-aware, multi-attribute fairness evaluation frameworks.

Related papers

RoboView-Bias: Benchmarking Visual Bias in Embodied Agents for Robotic Manipulation [67.38036090822982]
We propose RoboView-Bias, the first benchmark specifically designed to quantify visual bias in robotic manipulation.<n>We create 2,127 task instances that enable robust measurement of biases induced by individual visual factors and their interactions.<n>Our results highlight that systematic analysis of visual bias is a prerequisite for developing safe and reliable general-purpose embodied agents.
arXiv Detail & Related papers (2025-09-26T13:53:25Z)
Intrinsic Meets Extrinsic Fairness: Assessing the Downstream Impact of Bias Mitigation in Large Language Models [11.396244643030983]
Large Language Models (LLMs) exhibit socio-economic biases that can propagate into downstream tasks.<n>We present a unified evaluation framework to compare intrinsic bias mitigation via concept unlearning with extrinsic bias mitigation via counterfactual data augmentation.<n>Our results show that intrinsic bias mitigation through unlearning reduces intrinsic gender bias by up to 94.9%, while also improving downstream task fairness metrics, such as demographic parity by up to 82%, without compromising accuracy.
arXiv Detail & Related papers (2025-09-19T22:59:55Z)
Assessing Judging Bias in Large Reasoning Models: An Empirical Study [99.86300466350013]
Large Reasoning Models (LRMs) like DeepSeek-R1 and OpenAI-o1 have demonstrated remarkable reasoning capabilities.<n>We present a benchmark comparing judging biases between LLMs and LRMs across both subjective preference-alignment datasets and objective fact-based datasets.
arXiv Detail & Related papers (2025-04-14T07:14:27Z)
Metamorphic Testing for Fairness Evaluation in Large Language Models: Identifying Intersectional Bias in LLaMA and GPT [2.380039717474099]
Large Language Models (LLMs) have made significant strides in Natural Language Processing but remain vulnerable to fairness-related issues.<n>This paper introduces a metamorphic testing approach to systematically identify fairness bugs in LLMs.
arXiv Detail & Related papers (2025-04-04T21:04:14Z)
Towards counterfactual fairness through auxiliary variables [11.756940915048713]
We introduce EXOgenous Causal reasoning (EXOC), a novel causal reasoning framework motivated by variables.<n>Our framework explicitly defines an auxiliary node and a control node that contribute to counterfactual fairness.<n>Our evaluation, conducted on synthetic and real-world datasets, validates EXOC's superiority.
arXiv Detail & Related papers (2024-12-06T04:23:05Z)
The Fragility of Fairness: Causal Sensitivity Analysis for Fair Machine Learning [34.50562695587344]
We adapt tools from causal sensitivity analysis to the FairML context. We analyze the sensitivity of the most common parity metrics under 3 varieties of classifier. We show that causal sensitivity analysis provides a powerful and necessary toolkit for gauging the informativeness of parity metric evaluations.
arXiv Detail & Related papers (2024-10-12T17:28:49Z)
What Hides behind Unfairness? Exploring Dynamics Fairness in Reinforcement Learning [52.51430732904994]
In reinforcement learning problems, agents must consider long-term fairness while maximizing returns. Recent works have proposed many different types of fairness notions, but how unfairness arises in RL problems remains unclear. We introduce a novel notion called dynamics fairness, which explicitly captures the inequality stemming from environmental dynamics.
arXiv Detail & Related papers (2024-04-16T22:47:59Z)
Fairness Explainability using Optimal Transport with Applications in Image Classification [0.46040036610482665]
We propose a comprehensive approach to uncover the causes of discrimination in Machine Learning applications. We leverage Wasserstein barycenters to achieve fair predictions and introduce an extension to pinpoint bias-associated regions. This allows us to derive a cohesive system which uses the enforced fairness to measure each features influence emphon the bias.
arXiv Detail & Related papers (2023-08-22T00:10:23Z)
Practical Approaches for Fair Learning with Multitype and Multivariate Sensitive Attributes [70.6326967720747]
It is important to guarantee that machine learning algorithms deployed in the real world do not result in unfairness or unintended social consequences. We introduce FairCOCCO, a fairness measure built on cross-covariance operators on reproducing kernel Hilbert Spaces. We empirically demonstrate consistent improvements against state-of-the-art techniques in balancing predictive power and fairness on real-world datasets.
arXiv Detail & Related papers (2022-11-11T11:28:46Z)
Measuring Fairness of Text Classifiers via Prediction Sensitivity [63.56554964580627]
ACCUMULATED PREDICTION SENSITIVITY measures fairness in machine learning models based on the model's prediction sensitivity to perturbations in input features. We show that the metric can be theoretically linked with a specific notion of group fairness (statistical parity) and individual fairness.
arXiv Detail & Related papers (2022-03-16T15:00:33Z)
Fairness without the sensitive attribute via Causal Variational Autoencoder [17.675997789073907]
Due to privacy purposes and var-ious regulations such as RGPD in EU, many personal sensitive attributes are frequently not collected. By leveraging recent developments for approximate inference, we propose an approach to fill this gap. Based on a causal graph, we rely on a new variational auto-encoding based framework named SRCVAE to infer a sensitive information proxy.
arXiv Detail & Related papers (2021-09-10T17:12:52Z)
MultiFair: Multi-Group Fairness in Machine Learning [52.24956510371455]
We study multi-group fairness in machine learning (MultiFair) We propose a generic end-to-end algorithmic framework to solve it. Our proposed framework is generalizable to many different settings.
arXiv Detail & Related papers (2021-05-24T02:30:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.