Related papers: When Wording Steers the Evaluation: Framing Bias in LLM judges

When Wording Steers the Evaluation: Framing Bias in LLM judges

URL: http://arxiv.org/abs/2601.13537v1
Date: Tue, 20 Jan 2026 02:48:10 GMT
Title: When Wording Steers the Evaluation: Framing Bias in LLM judges
Authors: Yerin Hwang, Dongryeol Lee, Taegwan Kang, Minwoo Lee, Kyomin Jung,
Abstract summary: Large language models (LLMs) are known to produce varying responses depending on prompt phrasing.<n>We investigate how deliberate prompt framing skews model judgments across four high-stakes evaluation tasks.<n>Across 14 judges, we observe clear susceptibility to framing, with model families showing distinct tendencies toward agreement or rejection.
Score: 23.16746081917015
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models (LLMs) are known to produce varying responses depending on prompt phrasing, indicating that subtle guidance in phrasing can steer their answers. However, the impact of this framing bias on LLM-based evaluation, where models are expected to make stable and impartial judgments, remains largely underexplored. Drawing inspiration from the framing effect in psychology, we systematically investigate how deliberate prompt framing skews model judgments across four high-stakes evaluation tasks. We design symmetric prompts using predicate-positive and predicate-negative constructions and demonstrate that such framing induces significant discrepancies in model outputs. Across 14 LLM judges, we observe clear susceptibility to framing, with model families showing distinct tendencies toward agreement or rejection. These findings suggest that framing bias is a structural property of current LLM-based evaluation systems, underscoring the need for framing-aware protocols.

Related papers

Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z)
Evaluating Scoring Bias in LLM-as-a-Judge [8.67484421243584]
Large Language Models (LLMs) are employed as evaluators for complex tasks.<n>There are various biases within LLM-as-a-Judge, which adversely affect the fairness and reliability of judgments.
arXiv Detail & Related papers (2025-06-27T15:25:23Z)
A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models [58.32070787537946]
Chain-of-thought (CoT) reasoning enhances performance of large language models.<n>We present the first comprehensive study of CoT faithfulness in large vision-language models.
arXiv Detail & Related papers (2025-05-29T18:55:05Z)
Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation [57.380464382910375]
We show that the choice of feedback protocol for evaluation can significantly affect evaluation reliability and induce systematic biases.<n>We find that generator models can flip preferences by embedding distractor features.<n>We offer recommendations for choosing feedback protocols based on dataset characteristics and evaluation objectives.
arXiv Detail & Related papers (2025-04-20T19:05:59Z)
Meta-Evaluating Local LLMs: Rethinking Performance Metrics for Serious Games [3.725822359130832]
Large Language Models (LLMs) are increasingly being explored as evaluators in serious games.<n>This study investigates the reliability of five small-scale LLMs when assessing player responses in textitEn-join, a game that simulates decision-making within energy communities.<n>Our results highlight the strengths and limitations of each model, revealing trade-offs between sensitivity, specificity, and overall performance.
arXiv Detail & Related papers (2025-04-13T10:46:13Z)
The simulation of judgment in LLMs [32.57692724251287]
Large Language Models (LLMs) are increasingly embedded in evaluative processes, from information filtering to assessing and addressing knowledge gaps through explanation and credibility judgments.<n>This raises the need to examine how such evaluations are built, what assumptions they rely on, and how their strategies diverge from those of humans.<n>We benchmark six LLMs against expert ratings--NewsGuard and Media Bias/Fact Check--and against human judgments collected through a controlled experiment.
arXiv Detail & Related papers (2025-02-06T18:52:10Z)
Diverging Preferences: When do Annotators Disagree and do Models Know? [92.24651142187989]
We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes. We find that the majority of disagreements are in opposition with standard reward modeling approaches. We develop methods for identifying diverging preferences to mitigate their influence on evaluation and training.
arXiv Detail & Related papers (2024-10-18T17:32:22Z)
Sycophancy in Vision-Language Models: A Systematic Analysis and an Inference-Time Mitigation Framework [18.54098084470481]
We analyze sycophancy across vision-language benchmarks and propose an inference-time mitigation framework.<n>Our framework effectively mitigates sycophancy across all evaluated models, while maintaining performance on neutral prompts.
arXiv Detail & Related papers (2024-08-21T01:03:21Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.