Fugu-MT 論文翻訳(概要): Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR

論文の概要: Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR

arxiv url: http://arxiv.org/abs/2605.28295v1
Date: Wed, 27 May 2026 10:46:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-28 17:38:55.984298
Title: Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR
Title（参考訳）: RLVRのローロードと高レベルファーストトークンの多角化
Authors: Soeun Kim, Albert No,
Abstract要約: RLVR(Reinforcement Learning with Verifiable Rewards)は、ラベル付き軌跡のないモデル推論を訓練する。 RLVRでは、ロールアウトの多様性が中心的なボトルネックとなっている。本稿では,REFT(Rollout Exploration with First-Token Diversification)を紹介する。REFT(Rollout Exploration with First-Token Diversification)は,RLVRパイプラインの軽量化で,ポリシの上位N$候補からファーストトークンを均一にサンプリングする。
参考スコア（独自算出の注目度）: 6.149635000057214
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) trains reasoning models without labeled trajectories, relying on grouped rollouts to expose the policy to alternative reasoning paths and a verifier to score them. Rollout diversity has accordingly emerged as a central bottleneck in RLVR, with most existing methods broadening exploration through temperature, prefix, or rollout-selection adjustments. We identify a structurally distinguished but overlooked position for broadening this diversity: the first token after the reasoning marker. The policy's first-token distribution exhibits a sharply peaked yet correctness-decoupled phenomenon, and this first token position can broaden the regions a rollout group covers without altering the correctness signal. We introduce REFT (Rollout Exploration with First-Token Diversification), a light addition to the RLVR pipeline that samples first tokens uniformly from the policy's own top-$N$ candidates and allocates rollouts evenly, leaving every other component unchanged. Trained on the resulting diversified rollouts, REFT improves aggregate Pass@1, Pass@8, and Pass@64 over DAPO and GRPO baselines across four base models (0.5B-7B) and three difficulty regimes.
Abstract（参考訳）: Reinforcement Learning with Verifiable Rewards (RLVR) は、ラベル付き軌跡のない推論モデルを訓練し、グループ化されたロールアウトに依存して、ポリシーを代替の推論パスに公開し、検証者が評価する。ロールアウトの多様性は、RLVRにおける中心的なボトルネックとして現れており、既存のほとんどの手法は、温度、プレフィックス、ロールアウト選択の調整を通じて探索を広げている。我々は、この多様性を拡大するための構造的に区別されているが見過ごされた位置、すなわち、推論マーカーの後の最初のトークンを識別する。ポリシーの第1トーケン分布は、急激なピーク時に正当性分離現象を示し、この第1トークン位置は、正当性信号を変更することなく、ロールアウト群がカバーする領域を広げることができる。私たちはREFT(Rollout Exploration with First-Token Diversification)を紹介します。これはRLVRパイプラインの軽量な追加で、ポリシー自身のトップ$N$候補からファーストトークンを均一にサンプリングし、ロールアウトを均等に割り当て、他のすべてのコンポーネントは変わらないままにします。結果として得られた多彩なロールアウトに基づいて、REFTは4つのベースモデル(0.5B-7B)と3つの難易度でDAPOとGRPOのベースラインに対して、総合的なPass@1、Pass@8、Pass@64を改善した。

論文の概要: Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR

関連論文リスト