Fugu-MT 論文翻訳(概要): rStar2-Agent: Agentic Reasoning Technical Report

論文の概要: rStar2-Agent: Agentic Reasoning Technical Report

arxiv url: http://arxiv.org/abs/2508.20722v1
Date: Thu, 28 Aug 2025 12:45:25 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-29 18:12:02.390158
Title: rStar2-Agent: Agentic Reasoning Technical Report
Title（参考訳）: rStar2-Agent:エージェント推論技術報告
Authors: Ning Shang, Yifei Liu, Yi Zhu, Li Lyna Zhang, Weijiang Xu, Xinyu Guan, Buze Zhang, Bingcheng Dong, Xudong Zhou, Bowen Zhang, Ying Xin, Ziming Miao, Scarlett Li, Fan Yang, Mao Yang,
Abstract要約: rStar2-Agentは、エージェント強化学習を用いて訓練された14Bの数学推論モデルであり、フロンティアレベルの性能を実現する。この目的のために、rStar2-Agentはトレーニング済みの14Bモデルを1週間以内に510RLのステップで最先端に引き上げ、平均パス@1スコアはAIME24で80.6%、AIME25で69.8%となる。
参考スコア（独自算出の注目度）: 25.266747156205266
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning to achieve frontier-level performance. Beyond current long CoT, the model demonstrates advanced cognitive behaviors, such as thinking carefully before using Python coding tools and reflecting on code execution feedback to autonomously explore, verify, and refine intermediate steps in complex problem-solving. This capability is enabled through three key innovations that makes agentic RL effective at scale: (i) an efficient RL infrastructure with a reliable Python code environment that supports high-throughput execution and mitigates the high rollout costs, enabling training on limited GPU resources (64 MI300X GPUs); (ii) GRPO-RoC, an agentic RL algorithm with a Resample-on-Correct rollout strategy that addresses the inherent environment noises from coding tools, allowing the model to reason more effectively in a code environment; (iii) An efficient agent training recipe that starts with non-reasoning SFT and progresses through multi-RL stages, yielding advanced cognitive abilities with minimal compute cost. To this end, rStar2-Agent boosts a pre-trained 14B model to state of the art in only 510 RL steps within one week, achieving average pass@1 scores of 80.6% on AIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) with significantly shorter responses. Beyond mathematics, rStar2-Agent-14B also demonstrates strong generalization to alignment, scientific reasoning, and agentic tool-use tasks. Code and training recipes are available at https://github.com/microsoft/rStar.
Abstract（参考訳）: rStar2-Agentは、エージェント強化学習を用いて訓練された14Bの数学推論モデルであり、フロンティアレベルの性能を実現する。現在のCoT以外にも、Pythonコーディングツールを使用する前に慎重に考えることや、コード実行フィードバックを反映して、複雑な問題解決における中間ステップを自律的に探索し、検証し、洗練するといった、高度な認知行動を示すモデルもある。この機能は,エージェントRLを大規模に効果的にする3つの重要なイノベーションを通じて実現されている。 i)高スループット実行をサポートし、ロールアウトコストを軽減し、限られたGPUリソース(64 MI300X GPU)のトレーニングを可能にする、信頼性の高いPythonコード環境を備えた効率的なRLインフラストラクチャ。 (ii)GRPO-RoC(Resample-on-Correct Rollout戦略を持つエージェントRLアルゴリズム)は、コーディングツールから固有の環境ノイズに対処し、より効果的なコード環境推論を可能にする。 3)非共振型SFTから始まり,マルチRL段階に進む効率的なエージェント訓練法により,計算コストが最小限に抑えられた高度な認知能力が得られる。この目的のために、rStar2-Agentは、トレーニング済みの14Bモデルを1週間以内に510RLのステップで最先端に引き上げ、平均パス@1スコアはAIME24で80.6%、AIME25で69.8%を獲得し、レスポンスはDeepSeek-R1 (671B)を上回っている。数学以外にも、rStar2-Agent-14Bはアライメント、科学的推論、エージェントツールの使用タスクへの強力な一般化も示している。コードとトレーニングのレシピはhttps://github.com/microsoft/rStar.comで入手できる。

論文の概要: rStar2-Agent: Agentic Reasoning Technical Report

関連論文リスト