Explicit Reasoning Makes Better Judges: A Systematic Study on Accuracy, Efficiency, and Robustness
- URL: http://arxiv.org/abs/2509.13332v1
- Date: Tue, 09 Sep 2025 18:36:02 GMT
- Title: Explicit Reasoning Makes Better Judges: A Systematic Study on Accuracy, Efficiency, and Robustness
- Authors: Pratik Jayarao, Himanshu Gupta, Neeraj Varshney, Chaitanya Dwivedi,
- Abstract summary: We present a systematic comparison of "thinking" and "non-thinking" Large Language Models (LLMs)<n>We evaluate both accuracy and computational efficiency (FLOPs) on RewardBench tasks.<n>Our results show that thinking models achieve approximately 10% points higher accuracy with little overhead.
- Score: 12.513874407270142
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As Large Language Models (LLMs) are increasingly adopted as automated judges in benchmarking and reward modeling, ensuring their reliability, efficiency, and robustness has become critical. In this work, we present a systematic comparison of "thinking" and "non-thinking" LLMs in the LLM-as-a-judge paradigm using open-source Qwen 3 models of relatively small sizes (0.6B, 1.7B, and 4B parameters). We evaluate both accuracy and computational efficiency (FLOPs) on RewardBench tasks, and further examine augmentation strategies for non-thinking models, including in-context learning, rubric-guided judging, reference-based evaluation, and n-best aggregation. Our results show that despite these enhancements, non-thinking models generally fall short of their thinking counterparts. Our results show that thinking models achieve approximately 10% points higher accuracy with little overhead (under 2x), in contrast to augmentation strategies like few-shot learning, which deliver modest gains at a higher cost (>8x). Bias and robustness analyses further demonstrate that thinking models maintain significantly greater consistency under a variety of bias conditions such as positional, bandwagon, identity, diversity, and random biases (6% higher on average). We further extend our experiments to the multilingual setting and our results confirm that explicit reasoning extends its benefits beyond English. Overall, our work results in several important findings that provide systematic evidence that explicit reasoning offers clear advantages in the LLM-as-a-judge paradigm not only in accuracy and efficiency but also in robustness.
Related papers
- CAMEL: Confidence-Gated Reflection for Reward Modeling [26.908515245229747]
We propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first.<n>We train the model via reinforcement learning with counterfactual prefix augmentation, which exposes the model to diverse initial verdicts and encourages genuine revision.<n> Empirically, CAMEL achieves state-of-the-art performance on three widely used reward-model benchmarks with 82.9% average accuracy.
arXiv Detail & Related papers (2026-02-24T08:20:08Z) - Learning to Trust the Crowd: A Multi-Model Consensus Reasoning Engine for Large Language Models [0.0]
Large language models (LLMs) achieve strong aver- age performance yet remain unreliable at the instance level.<n>We introduce a Multi-Model Consensus Reasoning Engine that treats the set of LLM outputs as input to a supervised meta-learner.<n>The system maps natural language responses into structured features using semantic embeddings, pairwise similarity and clustering statistics, lexical and structural cues, reasoning-quality scores, confidence estimates, and model-specific priors.
arXiv Detail & Related papers (2026-01-12T06:27:06Z) - The Art of Saying "Maybe": A Conformal Lens for Uncertainty Benchmarking in VLMs [0.0]
Vision-Language Models (VLMs) have achieved remarkable progress in complex visual understanding across scientific and reasoning tasks.<n>We conduct a comprehensive uncertainty benchmarking study, evaluating 16 state-of-the-art VLMs across 6 multimodal datasets with 3 distinct scoring functions.
arXiv Detail & Related papers (2025-09-16T08:17:39Z) - Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges [72.3356133063925]
The paradigm of large language models (LLMs) as judges has emerged as a scalable solution, yet prior work primarily focuses on simple settings.<n>Our in-depth analysis offers crucial insights for improving the accuracy and verifiability of evaluation signals.
arXiv Detail & Related papers (2025-09-03T15:48:33Z) - FairReason: Balancing Reasoning and Social Bias in MLLMs [54.26091556079722]
Multimodal Large Language Models (MLLMs) already achieve state-of-the-art results across a wide range of tasks and modalities.<n>Recent studies explore advanced prompting schemes and post-training fine-tuning to push their reasoning ability further.
arXiv Detail & Related papers (2025-07-30T19:57:22Z) - Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs [7.197702136906138]
We propose an uncertainty-aware fairness metric, UCerF, to enable a fine-grained evaluation of model fairness.<n> observing data size, diversity, and clarity issues in current datasets, we introduce a new gender-occupation fairness evaluation dataset.<n>We establish a benchmark, using our metric and dataset, and apply it to evaluate the behavior of ten open-source AI systems.
arXiv Detail & Related papers (2025-05-29T20:45:18Z) - Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications [0.7124971549479361]
This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification.<n>We determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability.
arXiv Detail & Related papers (2025-05-20T21:12:58Z) - R-PRM: Reasoning-Driven Process Reward Modeling [53.06844294668382]
Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step.<n>Existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy.<n>We propose Reasoning-Driven Process Reward Modeling (R-PRM)<n>R-PRM generates seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities.
arXiv Detail & Related papers (2025-03-27T09:23:08Z) - Scalable Best-of-N Selection for Large Language Models via Self-Certainty [65.31658824274894]
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models.<n>We propose self-certainty, a novel and efficient metric to estimate response quality without requiring external reward models.<n>Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities.
arXiv Detail & Related papers (2025-02-25T19:08:07Z) - Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models [0.6091702876917281]
Large Language Models (LLMs) show remarkable proficiency in natural language tasks.<n>Overconfidence-misalignment between predicted confidence and true correctness poses significant risks in critical decision-making applications.<n>We present a comprehensive analysis on calibration in LLMs across nine LLMs and three factual Question-Answering datasets.
arXiv Detail & Related papers (2025-02-16T07:46:09Z) - Language Model Preference Evaluation with Multiple Weak Evaluators [89.90733463933431]
We introduce PGED, a novel approach that leverages multiple model-based evaluators to construct preference graphs, and then ensembles and denoises these graphs for acyclic, non-contradictory evaluation results.<n>We demonstrate PGED's superiority in three applications: 1) model ranking for evaluation, 2) response selection for test-time scaling, and 3) data selection for model fine-tuning.
arXiv Detail & Related papers (2024-10-14T01:57:25Z) - Sparse MoEs meet Efficient Ensembles [49.313497379189315]
We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs)
We present Efficient Ensemble of Experts (E$3$), a scalable and simple ensemble of sparse MoEs that takes the best of both classes of models, while using up to 45% fewer FLOPs than a deep ensemble.
arXiv Detail & Related papers (2021-10-07T11:58:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.