Related papers: Reasoning Model Is Superior LLM-Judge, Yet Suffers from Biases

Reasoning Model Is Superior LLM-Judge, Yet Suffers from Biases

URL: http://arxiv.org/abs/2601.03630v1
Date: Wed, 07 Jan 2026 06:19:26 GMT
Title: Reasoning Model Is Superior LLM-Judge, Yet Suffers from Biases
Authors: Hui Huang, Xuanxin Wu, Muyun Yang, Yuki Arase,
Abstract summary: This paper presents the first systematic comparison investigating whether Large Reasoning Models (LRMs) are superior judge to non-reasoning LLMs.<n>Our empirical analysis yields four key findings: 1) LRMs outperform non-reasoning LLMs in terms of judgment accuracy, particularly on reasoning-intensive tasks; 2) LRMs demonstrate superior instruction-following capabilities in evaluation contexts; 3) LRMs exhibit enhanced robustness against adversarial attacks targeting judgment tasks; 4) However, LRMs still exhibit strong biases in superficial quality.
Score: 20.096872828837018
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: This paper presents the first systematic comparison investigating whether Large Reasoning Models (LRMs) are superior judge to non-reasoning LLMs. Our empirical analysis yields four key findings: 1) LRMs outperform non-reasoning LLMs in terms of judgment accuracy, particularly on reasoning-intensive tasks; 2) LRMs demonstrate superior instruction-following capabilities in evaluation contexts; 3) LRMs exhibit enhanced robustness against adversarial attacks targeting judgment tasks; 4) However, LRMs still exhibit strong biases in superficial quality. To improve the robustness against biases, we propose PlanJudge, an evaluation strategy that prompts the model to generate an explicit evaluation plan before execution. Despite its simplicity, our experiments demonstrate that PlanJudge significantly mitigates biases in both LRMs and standard LLMs.

Related papers

Are We on the Right Way to Assessing LLM-as-a-Judge? [16.32248269615178]
We introduce Sage, a novel evaluation suite that assesses the quality of LLM judges without requiring human annotation.<n>Inspired by axioms of rational choice theory, Sage introduces two new lenses for measuring LLM-as-a-Judge: local self-consistency and global logical consistency.<n>Based on Sage, we reveal that current state-of-the-art LLMs exhibit significant reliability problems when acting as judges in both scoring and pairwise settings.
arXiv Detail & Related papers (2025-12-17T23:49:55Z)
Rectify Evaluation Preference: Improving LLMs' Critique on Math Reasoning via Perplexity-aware Reinforcement Learning [34.43632129774481]
In this paper, we quantify and investigate the potential reason -- imbalanced evaluation preference.<n>Motivated by the analysis of the reason, a novel perplexity-aware reinforcement learning algorithm is proposed to rectify the evaluation preference.
arXiv Detail & Related papers (2025-11-13T13:37:45Z)
From <Answer> to <Think>: Multidimensional Supervision of Reasoning Process for LLM Optimization [62.07990937720985]
Dimension-level Reward Model (DRM) is a new supervision framework for Large Language Models.<n>DRM evaluates the quality of a reasoning process along three fundamental, complementary, and interpretable dimensions.<n> Experimental results show that DRM provides effective supervision signals, guides the optimization of LLMs and enhances their reasoning ability.
arXiv Detail & Related papers (2025-10-13T14:29:15Z)
Towards Evaluting Fake Reasoning Bias in Language Models [47.482898076525494]
We show that models favor the surface structure of reasoning even when the logic is flawed.<n>We introduce THEATER, a benchmark that systematically investigates Fake Reasoning Bias (FRB)<n>We evaluate 17 advanced Large Language Models (LRMs) on both subjective DPO and factual datasets.
arXiv Detail & Related papers (2025-07-18T09:06:10Z)
J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization [69.23273504123941]
We train judges to be robust to positional biases that arise in more complex evaluation settings.<n>We introduce ReasoningJudgeBench, a benchmark that evaluates judges in diverse reasoning settings not covered by prior work.<n>We train Judge for Reasoning (J4R), a 7B judge trained with EIS-GRPO that outperforms GPT-4o and the next best small judge by 6.7% and 9%.
arXiv Detail & Related papers (2025-05-19T16:50:35Z)
Assessing Judging Bias in Large Reasoning Models: An Empirical Study [99.86300466350013]
Large Reasoning Models (LRMs) like DeepSeek-R1 and OpenAI-o1 have demonstrated remarkable reasoning capabilities.<n>We present a benchmark comparing judging biases between LLMs and LRMs across both subjective preference-alignment datasets and objective fact-based datasets.
arXiv Detail & Related papers (2025-04-14T07:14:27Z)
JudgeLRM: Large Reasoning Models as a Judge [80.07261839142548]
We introduce JudgeLRM, a family of judgment-oriented Large Language Models (LLMs)<n>We find a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples, revealing the limits of SFT in such scenarios.<n>We show that JudgeLRM consistently outperform SFT-tuned baselines in the same size, as well as other RL and SFT variants, and even surpass state-of-the-art reasoning models.
arXiv Detail & Related papers (2025-03-31T02:18:51Z)
PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably.<n>This poses a significant challenge to ensuring their safe deployment.<n>We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.