Fugu-MT 論文翻訳(概要): Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration

論文の概要: Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration

arxiv url: http://arxiv.org/abs/2510.03865v1
Date: Sat, 04 Oct 2025 16:22:19 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.305947
Title: Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration
Title（参考訳）: 強化学習探索によるLLMのアンロック推論機能
Authors: Wenhao Deng, Long Wei, Chenglei Yu, Tailin Wu,
Abstract要約: より広範に焦点を絞った探索を促進するアルゴリズムであるRAPOを提案する。 8K SimpleRL-Zeroデータセット上で,RAPOを用いてQwen2.5-3Bと7Bモデルをトレーニングする。その結果,RAPOは一貫して問題解決性能を向上することがわかった。
参考スコア（独自算出の注目度）: 8.839121572048018
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning with verifiable rewards (RLVR) has recently enhanced the reasoning capabilities of large language models (LLMs), particularly for mathematical problem solving. However, a fundamental limitation remains: as the sampling budget increases, the advantage of RLVR-trained models over their pretrained bases often diminishes or even vanishes, revealing a strong dependence on the base model's restricted search space. We attribute this phenomenon to the widespread use of the reverse Kullback-Leibler (KL) divergence regularizer, whose mode-seeking behavior keeps the policy trapped inside the base model's support region and hampers wider exploration. To address this issue, we propose RAPO (Rewards-Aware Policy Optimization), an algorithm to promote broader yet focused exploration. Our method (i) utilizes the forward KL penalty to replace the reverse KL penalty for out-of-distribution exploration, and (ii) reweights the reference policy to facilitate adaptive in-distribution exploration. We train Qwen2.5-3B and 7B models with RAPO on the 8K SimpleRL-Zero dataset, without supervised fine-tuning, and evaluate them on AIME2024 and AIME2025. Results show that RAPO consistently improves problem-solving performance. Notably, RAPO enables models to surpass the base model's performance ceiling and solves previously intractable problems, advancing the frontier of RLVR for challenging reasoning tasks.
Abstract（参考訳）: 検証可能な報酬付き強化学習(RLVR)は近年,大規模言語モデル(LLM)の推論能力を強化している。しかし、サンプリング予算が増加するにつれて、事前訓練されたベースに対するRLVR訓練モデルの利点はしばしば減少または消滅し、ベースモデルの制限された検索空間への強い依存が明らかになる。我々は,この現象を,モード探索動作が基本モデルの支持領域内に閉じ込められたポリシーを維持し,広い探索を妨げている逆カルバック・リブラー(KL)分散正規化器(英語版)の広汎な利用に起因している。この問題に対処するために,より広範囲に焦点を絞った探索を促進するアルゴリズムであるRAPO(Rewards-Aware Policy Optimization)を提案する。我々の方法一配当外探査の逆KL罰の代替として前方KL罰を利用すること。 (二)適応的流通探究を促進するための基準方針を再検討すること。我々は,8K SimpleRL-Zeroデータセット上でRAPOを用いてQwen2.5-3Bおよび7Bモデルを訓練し,それらをAIME2024およびAIME2025で評価した。その結果,RAPOは一貫して問題解決性能を向上することがわかった。特にRAPOは、モデルがベースモデルの性能天井を超えることを可能にし、これまで難解だった問題を解消し、RLVRのフロンティアを挑戦的な推論タスクへと前進させる。

論文の概要: Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration

関連論文リスト