Fugu-MT 論文翻訳(概要): Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning

論文の概要: Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning

arxiv url: http://arxiv.org/abs/2510.23038v1
Date: Mon, 27 Oct 2025 06:03:37 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 15:28:15.467672
Title: Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning
Title（参考訳）: ツール強化強化学習によるLLM審査員のエージェント推論のインセンティブ化
Authors: Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan, Carl Yang, Hongkun Yu,
Abstract要約: 大きな言語モデル(LLM)は、応答品質を評価するために広く使われ、人間の評価に代わるスケーラブルな代替手段を提供する。我々は,LLM審査員を訓練するためのエンドツーエンドのRLフレームワークであるTIR-Judgeを提案し,正確な評価のためにコードエグゼキュータを統合する。
参考スコア（独自算出の注目度）: 30.906073889018728
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) are widely used as judges to evaluate response quality, providing a scalable alternative to human evaluation. However, most LLM judges operate solely on intrinsic text-based reasoning, limiting their ability to verify complex constraints or perform accurate computation. Motivated by the success of tool-integrated reasoning (TIR) in numerous tasks, we propose TIR-Judge, an end-to-end RL framework for training LLM judges that integrates a code executor for precise evaluation. TIR-Judge is built on three principles: (i) diverse training across verifiable and non-verifiable domains, (ii) flexible judgment formats (pointwise, pairwise, listwise), and (iii) iterative RL that bootstraps directly from the initial model without distillation. On seven public benchmarks, TIR-Judge surpasses strong reasoning-based judges by up to 6.4% (pointwise) and 7.7% (pairwise), and achieves listwise performance comparable to Claude-Opus-4 despite having only 8B parameters. Remarkably, TIR-Judge-Zero - trained entirely without distilled judge trajectories, matches the performance of distilled variants, demonstrating that tool-augmented judges can self-evolve through iterative reinforcement learning.
Abstract（参考訳）: 大きな言語モデル(LLM)は、応答品質を評価するために広く使われ、人間の評価に代わるスケーラブルな代替手段を提供する。しかし、ほとんどのLLM判事は本質的なテキストベースの推論のみを運用し、複雑な制約の検証や正確な計算を行う能力を制限する。ツール統合推論(TIR)を多くのタスクで成功させることで、我々は、コードエグゼキュータを統合して正確な評価を行うLLM審査員を訓練するためのエンドツーエンドのRLフレームワークであるTIR-Judgeを提案する。 TIR-Judgeは3つの原則に基づいて構築されている。一検証可能領域及び検証不能領域にまたがる多様な訓練 (二)柔軟な判断形式(点数、対数、リスト数)、三蒸留せずに初期モデルから直接ブートストラップする反復RL。 TIR-Judgeは7つの公開ベンチマークで、強力な推論に基づく判断を最大6.4%(ポイントワイド)と7.7%(ペアワイド)で上回り、8Bパラメータしか持たないにもかかわらず、Claude-Opus-4に匹敵するパフォーマンスを達成している。 TIR-Judge-Zeroは、蒸留された判断軌跡なしで完全に訓練され、蒸留された変種のパフォーマンスと一致し、ツール強化された審査員が反復的な強化学習を通じて自己進化できることを実証している。

論文の概要: Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning

関連論文リスト