Fugu-MT 論文翻訳(概要): AgentV-RL: Scaling Reward Modeling with Agentic Verifier

論文の概要: AgentV-RL: Scaling Reward Modeling with Agentic Verifier

arxiv url: http://arxiv.org/abs/2604.16004v1
Date: Fri, 17 Apr 2026 12:27:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-20 22:00:19.909682
Title: AgentV-RL: Scaling Reward Modeling with Agentic Verifier
Title（参考訳）: AgentV-RL:エージェント検証器によるスケーリングリワードモデリング
Authors: Jiazheng Zhang, Ziche Fu, Zhiheng Xi, Wenqing Jing, Mingxu Chai, Wei He, Guoqiang Zhang, Chenghao Fan, Chenxin An, Wenxiang Chen, Zhicheng Liu, Haojie Pan, Dingwei Zhu, Tao Gui, Qi Zhang, Xuanjing Huang,
Abstract要約: 試験時間スケーリング(TTS)によるLCM推論を強化する検証器が実証されている。本稿では,報酬モデリングを多ターンツール拡張型検討プロセスに変換するフレームワークであるエージェント検証を提案する。 Agentic Verifier は並列およびシーケンシャルTS の両方で一貫した性能向上が得られることを示す。
参考スコア（独自算出の注目度）: 63.55502685076245
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Verifiers have been demonstrated to enhance LLM reasoning via test-time scaling (TTS). Yet, they face significant challenges in complex domains. Error propagation from incorrect intermediate reasoning can lead to false positives for seemingly plausible solutions, while lacking external grounding makes verifiers unreliable on computation or knowledge-intensive tasks. To address these challenges, we propose Agentic Verifier, a framework that transforms reward modeling into a multi-turn, tool-augmented deliberative process. We introduce complementary forward and backward agents: one traces solutions from premises to conclusions, while the other re-checks conclusions against their underlying premises. This bidirectional process enables a comprehensive, reliable, and interpretable assessment of solutions. To facilitate practical deployment, we propose AgentV-RL. Through proactive exploration and reinforcement learning, the verifier autonomously interleaves tool-use with internal reasoning. Extensive experiments show that Agentic Verifier yields consistent performance gains under both parallel and sequential TTS. Notably, our 4B variant surpasses state-of-the-art ORMs by 25.2%, positioning it as a promising paradigm for agentic reward modeling.
Abstract（参考訳）: 検証器はテスト時間スケーリング(TTS)によるLCM推論を強化することが実証されている。しかし、それらは複雑な領域において重大な課題に直面している。不正確な中間推論からの誤り伝播は、一見可算な解に対して偽陽性をもたらすが、外部基底が欠如しているため、検証は計算や知識集約的なタスクでは信頼できない。これらの課題に対処するために,報奨モデリングを多ターンツール強化の熟考プロセスに変換するフレームワークであるAgentic Verifierを提案する。 1つは前提から結論までソリューションをトレースし、もう1つは基礎となる前提に対して結論を再確認する。この双方向プロセスは、ソリューションの包括的で信頼性があり、解釈可能な評価を可能にする。本稿では,AgentV-RLを提案する。積極的な探索と強化学習により、検証者は道具使用と内部推論を自律的にインターリーブする。拡張実験により, エージェント検証器は並列およびシーケンシャルTTSの両方で一貫した性能向上が得られることがわかった。特に、当社の4Bバージョンは最先端のORMを25.2%上回り、エージェント報酬モデリングの有望なパラダイムとして位置づけています。

論文の概要: AgentV-RL: Scaling Reward Modeling with Agentic Verifier

関連論文リスト