Related papers: DeFrame: Debiasing Large Language Models Against Framing Effects

DeFrame: Debiasing Large Language Models Against Framing Effects

URL: http://arxiv.org/abs/2602.04306v1
Date: Wed, 04 Feb 2026 08:15:51 GMT
Title: DeFrame: Debiasing Large Language Models Against Framing Effects
Authors: Kahee Lim, Soyeon Kim, Steven Euijong Whang,
Abstract summary: Large language models (LLMs) are increasingly deployed in real-world applications, ensuring their fair responses across demographics has become crucial.<n>Despite many efforts, an ongoing challenge is hidden bias: LLMs appear fair under standard evaluations, but can produce biased responses outside evaluation settings.<n>We identify framing -- differences in how semantically equivalent prompts are expressed -- as an underexplored contributor to this gap.
Score: 12.839436067299188
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) are increasingly deployed in real-world applications, ensuring their fair responses across demographics has become crucial. Despite many efforts, an ongoing challenge is hidden bias: LLMs appear fair under standard evaluations, but can produce biased responses outside those evaluation settings. In this paper, we identify framing -- differences in how semantically equivalent prompts are expressed (e.g., "A is better than B" vs. "B is worse than A") -- as an underexplored contributor to this gap. We first introduce the concept of "framing disparity" to quantify the impact of framing on fairness evaluation. By augmenting fairness evaluation benchmarks with alternative framings, we find that (1) fairness scores vary significantly with framing and (2) existing debiasing methods improve overall (i.e., frame-averaged) fairness, but often fail to reduce framing-induced disparities. To address this, we propose a framing-aware debiasing method that encourages LLMs to be more consistent across framings. Experiments demonstrate that our approach reduces overall bias and improves robustness against framing disparities, enabling LLMs to produce fairer and more consistent responses.

Related papers

Intra-Fairness Dynamics: The Bias Spillover Effect in Targeted LLM Alignment [3.1670140283390276]
We investigate how targeted gender alignment affects fairness across nine sensitive attributes in three state-of-the-art large language models (LLM)<n>Our findings reveal noticeable bias spillover: while aggregate results show improvements, context-aware analysis exposes significant degradations in ambiguous contexts.<n>We demonstrate that improving fairness along one attribute can inadvertently worsen disparities in others under uncertainty.
arXiv Detail & Related papers (2026-02-18T13:19:11Z)
HALF: Harm-Aware LLM Fairness Evaluation Aligned with Deployment [52.374772443536045]
HALF (Harm-Aware LLM Fairness) is a framework that assesses model bias in realistic applications and weighs the outcomes by harm severity.<n>We show that HALF exposes a clear gap between previous benchmarking success and deployment readiness.
arXiv Detail & Related papers (2025-10-14T07:13:26Z)
More or Less Wrong: A Benchmark for Directional Bias in LLM Comparative Reasoning [10.301985230669684]
We study the mechanisms by which semantic cues shape reasoning in large language models.<n>We introduce MathComp, a benchmark of 300 comparison scenarios.<n>We find that model errors frequently reflect linguistic steering, systematic shifts toward the comparative term present in the prompt.
arXiv Detail & Related papers (2025-06-04T13:15:01Z)
Relative Bias: A Comparative Framework for Quantifying Bias in LLMs [29.112649816695203]
Relative Bias is a method designed to assess how an LLM's behavior deviates from other LLMs within a specified target domain.<n>We introduce two complementary methodologies: (1) Embedding Transformation analysis, which captures relative bias patterns through sentence representations over the embedding space, and (2) LLM-as-a-Judge, which employs a language model to evaluate outputs comparatively.<n>Applying our framework to several case studies on bias and alignment scenarios following by statistical tests for validation, we find strong alignment between the two scoring methods.
arXiv Detail & Related papers (2025-05-22T01:59:54Z)
Fooling the LVLM Judges: Visual Biases in LVLM-Based Evaluation [14.521056434373213]
Large vision-language models (LVLMs) have emerged as the preferred tools for judging text-image alignment.<n>Can adversarial visual manipulations systematically fool LVLM judges into assigning unfairly inflated scores?<n>This work is the first study to address a key research question: Can adversarial visual manipulations systematically fool LVLM judges into assigning unfairly inflated scores?
arXiv Detail & Related papers (2025-05-21T08:24:28Z)
Ethical AI on the Waitlist: Group Fairness Evaluation of LLM-Aided Organ Allocation [19.66750942418172]
Using organ allocation as a case study, we introduce two tasks: (1) Choose-One and (2) Rank-All.<n>In Rank-All, LLMs rank all candidates for a kidney, reflecting real-world allocation processes.<n>Since traditional fairness metrics do not account for ranking, we propose a novel application of Borda scoring to capture biases.
arXiv Detail & Related papers (2025-03-29T04:36:25Z)
Towards Large Language Models that Benefit for All: Benchmarking Group Fairness in Reward Models [16.977176752570617]
Large Language Models (LLMs) are increasingly powerful and accessible to human users.<n> Ensuring fairness across diverse demographic groups, i.e., group fairness, is a critical ethical concern.<n>This work benchmarks the group fairness of learned reward models.
arXiv Detail & Related papers (2025-03-10T19:39:39Z)
Estimating Commonsense Plausibility through Semantic Shifts [66.06254418551737]
We propose ComPaSS, a novel discriminative framework that quantifies commonsense plausibility by measuring semantic shifts.<n> Evaluations on two types of fine-grained commonsense plausibility estimation tasks show that ComPaSS consistently outperforms baselines.
arXiv Detail & Related papers (2025-02-19T06:31:06Z)
Diverging Preferences: When do Annotators Disagree and do Models Know? [92.24651142187989]
We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes. We find that the majority of disagreements are in opposition with standard reward modeling approaches. We develop methods for identifying diverging preferences to mitigate their influence on evaluation and training.
arXiv Detail & Related papers (2024-10-18T17:32:22Z)
Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective [66.34066553400108]
We conduct a rigorous evaluation of large language models' implicit bias towards certain demographics.<n>Inspired by psychometric principles, we propose three attack approaches, i.e., Disguise, Deception, and Teaching.<n>Our methods can elicit LLMs' inner bias more effectively than competitive baselines.
arXiv Detail & Related papers (2024-06-20T06:42:08Z)
GPTBIAS: A Comprehensive Framework for Evaluating Bias in Large Language Models [83.30078426829627]
Large language models (LLMs) have gained popularity and are being widely adopted by a large user community. The existing evaluation methods have many constraints, and their results exhibit a limited degree of interpretability. We propose a bias evaluation framework named GPTBIAS that leverages the high performance of LLMs to assess bias in models.
arXiv Detail & Related papers (2023-12-11T12:02:14Z)
RankCSE: Unsupervised Sentence Representations Learning via Learning to Rank [54.854714257687334]
We propose a novel approach, RankCSE, for unsupervised sentence representation learning. It incorporates ranking consistency and ranking distillation with contrastive learning into a unified framework. An extensive set of experiments are conducted on both semantic textual similarity (STS) and transfer (TR) tasks.
arXiv Detail & Related papers (2023-05-26T08:27:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.