Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges
- URL: http://arxiv.org/abs/2509.03419v2
- Date: Fri, 31 Oct 2025 09:50:31 GMT
- Title: Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges
- Authors: Weiyuan Li, Xintao Wang, Siyu Yuan, Rui Xu, Jiangjie Chen, Qingqing Dong, Yanghua Xiao, Deqing Yang,
- Abstract summary: The paradigm of large language models (LLMs) as judges has emerged as a scalable solution, yet prior work primarily focuses on simple settings.<n>Our in-depth analysis offers crucial insights for improving the accuracy and verifiability of evaluation signals.
- Score: 72.3356133063925
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As large language models (LLMs) grow more capable, they face increasingly diverse and complex tasks, making reliable evaluation challenging. The paradigm of LLMs as judges has emerged as a scalable solution, yet prior work primarily focuses on simple settings. Their reliability in complex tasks--where multi-faceted rubrics, unstructured reference answers, and nuanced criteria are critical--remains understudied. In this paper, we constructed ComplexEval, a challenge benchmark designed to systematically expose and quantify Auxiliary Information Induced Biases. We systematically investigated and validated 6 previously unexplored biases across 12 basic and 3 advanced scenarios. Key findings reveal: (1) all evaluated models exhibit significant susceptibility to these biases, with bias magnitude scaling with task complexity; (2) notably, Large Reasoning Models (LRMs) show paradoxical vulnerability. Our in-depth analysis offers crucial insights for improving the accuracy and verifiability of evaluation signals, paving the way for more general and robust evaluation models.
Related papers
- Enhancing the QA Model through a Multi-domain Debiasing Framework [1.7802147489386633]
This study evaluates the ELECTRA-small model on the Stanford Question Answering dataset (SQuAD) v1.1 and adversarial datasets AddSent and AddOneSent.<n>We develop a multi-domain debiasing framework incorporating knowledge distillation, debiasing techniques, and domain expansion.
arXiv Detail & Related papers (2026-01-01T08:39:07Z) - Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements [78.87065404966002]
Existing benchmarks predominantly curate questions at the question level.<n>We propose Encyclo-K, a statement-based benchmark that rethinks benchmark construction from the ground up.
arXiv Detail & Related papers (2025-12-31T13:55:54Z) - Beyond Overall Accuracy: A Psychometric Deep Dive into the Topic-Specific Medical Capabilities of 80 Large Language Models [6.362188639024662]
We introduce textscMedIRT, a rigorous evaluation framework grounded in Item Response Theory (IRT)<n>We prospectively gathered fresh responses from 80 diverse Large Language Models (LLMs) on a balanced, 1,100-question USMLE-aligned benchmark.<n>We estimate LLM's latent model ability jointly with question difficulty and discrimination, yielding more stable and nuanced performance rankings than accuracy alone.
arXiv Detail & Related papers (2025-09-29T02:06:13Z) - CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z) - Reasoning Models Can be Easily Hacked by Fake Reasoning Bias [59.79548223686273]
We introduce THEATER, a comprehensive benchmark to evaluate Reasoning Theater Bias (RTB)<n>We investigate six bias types including Simple Cues and Fake Chain-of-Thought.<n>We identify'shallow reasoning'-plausible but flawed arguments-as the most potent form of RTB.
arXiv Detail & Related papers (2025-07-18T09:06:10Z) - VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains [19.579511315215424]
Large language models rely on reinforcement learning to enhance their reasoning capabilities through feedback.<n>Existing research focuses on building better verifiers, yet a systematic evaluation of different types of verifiers' performance remains lacking.<n>We construct 4,000 expert-level questions covering mathematics, physics, chemistry, and biology.<n>Each question is equipped with reference answers and diverse responses.
arXiv Detail & Related papers (2025-07-14T03:45:24Z) - Don't Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models [11.379764847748378]
Large language models (LLMs) often uncritically accept flawed or contradictory premises, leading to inefficient reasoning and unreliable outputs.<n>This emphasizes the significance of possessing the textbfPremise Critique Ability for LLMs, defined as the capacity to proactively identify and articulate errors in input premises.<n>We introduce the textbfPremise Critique Bench (PCBench), designed by incorporating four error types across three difficulty levels, paired with multi-faceted evaluation metrics.
arXiv Detail & Related papers (2025-05-29T17:49:44Z) - R-PRM: Reasoning-Driven Process Reward Modeling [53.06844294668382]
Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step.<n>Existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy.<n>We propose Reasoning-Driven Process Reward Modeling (R-PRM)<n>R-PRM generates seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities.
arXiv Detail & Related papers (2025-03-27T09:23:08Z) - Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.<n>Models may behave unreliably due to poorly explored failure modes.<n> causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z) - Quantifying Generalization Complexity for Large Language Models [31.721781613271066]
We introduce Scylla, a dynamic evaluation framework that quantitatively measures the generalization abilities of large language models.
Scylla disentangles generalization from memorization via assessing model performance on both in-distribution (ID) and out-of-distribution (OOD) data.
We benchmark 28LLMs including both open-sourced models such as LLaMA and Qwen families, and close-sourced models like Claude and GPT.
arXiv Detail & Related papers (2024-10-02T17:25:37Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.