Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning
- URL: http://arxiv.org/abs/2510.23038v1
- Date: Mon, 27 Oct 2025 06:03:37 GMT
- Title: Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning
- Authors: Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan, Carl Yang, Hongkun Yu,
- Abstract summary: Large Language Models (LLMs) are widely used as judges to evaluate response quality, providing a scalable alternative to human evaluation.<n>We propose TIR-Judge, an end-to-end RL framework for training LLM judges that integrates a code executor for precise evaluation.
- Score: 30.906073889018728
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are widely used as judges to evaluate response quality, providing a scalable alternative to human evaluation. However, most LLM judges operate solely on intrinsic text-based reasoning, limiting their ability to verify complex constraints or perform accurate computation. Motivated by the success of tool-integrated reasoning (TIR) in numerous tasks, we propose TIR-Judge, an end-to-end RL framework for training LLM judges that integrates a code executor for precise evaluation. TIR-Judge is built on three principles: (i) diverse training across verifiable and non-verifiable domains, (ii) flexible judgment formats (pointwise, pairwise, listwise), and (iii) iterative RL that bootstraps directly from the initial model without distillation. On seven public benchmarks, TIR-Judge surpasses strong reasoning-based judges by up to 6.4% (pointwise) and 7.7% (pairwise), and achieves listwise performance comparable to Claude-Opus-4 despite having only 8B parameters. Remarkably, TIR-Judge-Zero - trained entirely without distilled judge trajectories, matches the performance of distilled variants, demonstrating that tool-augmented judges can self-evolve through iterative reinforcement learning.
Related papers
- JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation [13.831735556002426]
Small language models (SLMs) have shown promise on various reasoning tasks.<n>Their ability to judge the correctness of answers remains unclear compared to large language models (LLMs)
arXiv Detail & Related papers (2025-11-20T01:14:39Z) - SSR: Socratic Self-Refine for Large Language Model Reasoning [78.62319252287938]
Socratic Self-Refine (SSR) is a novel framework for fine-grained evaluation and precise refinement of Large Language Models (LLMs)<n>Our proposed SSR decomposes model responses into verifiable (sub-question, sub-answer) pairs, enabling step-level confidence estimation.<n> Empirical results across five reasoning benchmarks and three LLMs show that SSR consistently outperforms state-of-the-art iterative self-refinement baselines.
arXiv Detail & Related papers (2025-11-13T18:47:07Z) - RLPR: Extrapolating RLVR to General Domains without Verifiers [103.14103272635893]
We propose RLPR, a simple verifier-free framework that extrapolates RLVR to broader general domains.<n>We find that addressing the high variance of this noisy probability reward is crucial to make it work.<n>RLPR consistently improves reasoning capabilities in both areas for Gemma, Llama, and Qwen based models.
arXiv Detail & Related papers (2025-06-23T02:56:36Z) - Quantitative LLM Judges [60.773734899532336]
We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to humans in a given domain.<n>The models are trained to improve the score of the original judge using its rationale and score.<n>Our experiments show that quantitative judges can improve the predictive power of existing judges through post-hoc modeling.
arXiv Detail & Related papers (2025-06-03T14:44:23Z) - J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization [69.23273504123941]
We train judges to be robust to positional biases that arise in more complex evaluation settings.<n>We introduce ReasoningJudgeBench, a benchmark that evaluates judges in diverse reasoning settings not covered by prior work.<n>We train Judge for Reasoning (J4R), a 7B judge trained with EIS-GRPO that outperforms GPT-4o and the next best small judge by 6.7% and 9%.
arXiv Detail & Related papers (2025-05-19T16:50:35Z) - J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning [54.85131761693927]
We introduce J1, a reinforcement learning framework for teaching LLM judges to think before making decisions.<n>Our core contribution lies in converting all judgment tasks for non-verifiable and verifiable prompts into a unified format with verifiable rewards.<n>We then use RL to train thinking-judges at scales of 8B, 32B, and 70B and show that they obtain state-of-the-art performance.
arXiv Detail & Related papers (2025-05-15T14:05:15Z) - JudgeLRM: Large Reasoning Models as a Judge [65.14085339820795]
We investigate whether Large Language Models (LLMs) judges truly benefit from enhanced reasoning capabilities.<n>We introduce JudgeLRM, a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards.
arXiv Detail & Related papers (2025-03-31T02:18:51Z) - JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking [81.88787401178378]
We introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance.
We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods.
In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability.
arXiv Detail & Related papers (2024-10-31T18:43:12Z) - Self-rationalization improves LLM as a fine-grained judge [21.917301609125417]
We introduce Self-Rationalization, an iterative process of improving the rationales for the judge models.
Self-rationalization works by having the model generate multiple judgments with rationales for the same input.
We show that our model learns to produce higher quality rationales, with a win rate of $62%$ on average compared to models just trained via SFT on rationale.
arXiv Detail & Related papers (2024-10-07T21:05:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.