Fugu-MT 論文翻訳(概要): SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs

論文の概要: SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs

arxiv url: http://arxiv.org/abs/2605.18864v1
Date: Fri, 15 May 2026 07:42:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:08.748259
Title: SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs
Title（参考訳）: SAGE: LLMのRLVRにおけるガイドド探索用アンカー
Authors: Chanuk Lee, Minki Kang, Sung Ju Hwang,
Abstract要約: 検証可能な報酬(RLVR)による強化学習は、推論タスクのpass@1を確実に改善するが、pass@kでは同等の利得を得られないことが多い。中心的な構造的制約は、トレーニングを安定させるが、本質的には基準分布にポリシーを固定する逆-KL正規化から生じる。我々は,逆KLアンカー分布自体を再構成することで,制御可能な経験的サポート拡張を可能にする,原則化されたフレームワークであるSAGEを提案する。
参考スコア（独自算出の注目度）: 55.46289074417954
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent studies observe that reinforcement learning with verifiable rewards (RLVR) reliably improves pass@1 on reasoning tasks, yet often fails to yield comparable gains in pass@k, raising the question of whether RLVR genuinely enables large language models to acquire novel reasoning abilities or merely enhances the efficiency of sampling reasoning modes already present in the base model. Prior analyses largely support the latter view, attributing this limitation to structural properties of standard RLVR objectives that result in insufficient exploration pressure. In this work, we argue that a central structural constraint arises from reverse-KL regularization, which stabilizes training but inherently anchors the policy to the reference distribution, thereby suppressing the emergence of alternative reasoning modes. However, we show that neither removing the KL term nor replacing it with forward-KL provides a satisfactory solution, as both disrupt the efficiency-coverage trade-off by either inducing reward hacking or allocating probability mass to off-target regions. To resolve this tension, we propose SAGE, a principled framework that enables controllable empirical support expansion by reshaping the reverse-KL anchor distribution itself through a guide function q(x,y), achieving consistent improvements in both pass@1 and pass@k across challenging mathematical reasoning benchmarks. Our code is available at https://github.com/tally0818/SAGE.
Abstract（参考訳）: 近年の研究では、検証可能な報酬(RLVR)による強化学習は、推論タスクにおいてパス@1を確実に改善するが、パス@kで同等の利得を得られないことが多いため、RLVRは、大規模言語モデルが真に新しい推論能力を獲得できるのか、あるいは、ベースモデルにすでに存在するサンプリング推論モードの効率を単に向上させるのか、という疑問が提起されている。先行分析は後者の見解を概ね支持しており、この制限は標準RLVR目標の構造的特性に起因し、探査圧力が不足する原因となっている。本研究では,学習を安定させるが,本来は基準分布にポリシーを固定し,代替的推論モードの出現を抑制する,逆-KL正則化から中心構造制約が生じることを論じる。しかしながら,KL項を削除したり,フォワードKLに置き換えたりすることは,報奨ハッキングを誘導したり,ターゲット外領域に確率質量を割り当てることによって効率・カバーのトレードオフを妨害するので,満足できる解決法であることを示す。そこで本研究では,逆KLアンカー分布自体をガイド関数 q(x,y) で再構成し,問題のある数学的推論ベンチマークで Pass@1 と pass@k の整合性向上を実現することにより,制御可能な経験的サポート拡張を実現するためのフレームワーク SAGE を提案する。私たちのコードはhttps://github.com/tally0818/SAGE.comで公開されています。

論文の概要: SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs

関連論文リスト