Fugu-MT 論文翻訳(概要): DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards

論文の概要: DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards

arxiv url: http://arxiv.org/abs/2605.08441v1
Date: Fri, 08 May 2026 20:03:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:49.647048
Title: DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
Title（参考訳）: DUET:検証リワードによる強化学習のためのトークン予算配分の最適化
Authors: Haoyu Hu, Xuandong Zhao, Xuhai "Orson'' Xu, Nori Jacoby,
Abstract要約: 検証可能な報酬による強化学習は、トレーニングステップ毎に数十万のトークンを生成します。共有計算予算下での両決定を共同で調整することで,推論品質とウォールクロックトレーニング時間の両方が向上することを示す。
参考スコア（独自算出の注目度）: 37.28110997883518
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning with verifiable rewards (RLVR) generates hundreds of thousands of tokens per training step, with rollout generation dominating the computational cost. The overall token budget can be controlled along two main dimensions: (i) deciding which prompts to allocate rollouts to, and (ii) deciding how long each rollout should be. Prior work has generally controlled only one of these dimensions at a time. We show that jointly tuning both decisions under a shared compute budget improves both reasoning quality and wall-clock training time. We instantiate this view as \textbf{DU}al-controlled tok\textbf{E}n alloca\textbf{T}ion (DUET), a computationally efficient layer over GRPO that uses a lightweight pre-rollout surrogate of prompt informativeness to set how many rollouts each prompt receives, and a marker-gated abort rule with importance reweighting to set when to stop them. On Qwen3-1.7B trained on MATH, DUET outperforms full-budget GRPO and the other three budget-aware baseline methods. DUET's advantage further generalizes to other benchmarks across math and coding, and is on par with the best baseline on the scientific Q\&A domain, while also achieving a $1.62\times$ wall-clock speedup. More notably, using only 50\% of the token budget, DUET still outperforms all baseline methods at their full budget, achieving an even higher $2.51\times$ speedup over full-budget GRPO. We verify the high performance of DUET on other backbone LLMs, including Qwen3-4B and Llama-3.2-3B-Instruct. Notably, the gap between DUET and the strongest baseline \emph{widens} as the budget tightens, contrary to the usual pattern in which efficient methods trade off quality as compute decreases. More broadly, these results suggest that DUET budget-aware control strategies are valuable not only for accelerating training, but also for improving the quality of the learning signal.
Abstract（参考訳）: 検証可能な報酬(RLVR)による強化学習は、トレーニングステップ毎に数十万のトークンを生成し、ロールアウト生成が計算コストを支配している。トークン全体の予算は、次の2つの主要な次元に沿って制御できる。 i) ロールアウトを割り当てるプロンプトの決定及び (二)各ロールアウトの期間を決定すること。それまでの作業は、一般的にこれらの次元の1つだけを一度に制御していた。共有計算予算下での両決定を共同で調整することで,推論品質とウォールクロックのトレーニング時間の両方が向上することを示す。我々は、この見解を \textbf{DU}al- controlled tok\textbf{E}n alloca\textbf{T}ion (DUET) として、GRPO上の計算効率のよいレイヤとして、各プロンプトが受信したロールアウト数を設定するための、軽量な事前ロールアウトサロゲート(pre-rollout surrogate)を用いて、停止するタイミングに重きを置くマーカー付きエイブラトルール(abort rule)をインスタンス化する。 MATHで訓練されたQwen3-1.7Bでは、DUETは全予算GRPOと他の3つの予算対応ベースライン法より優れていた。 DUETの利点は、数学やコーディングにおける他のベンチマークにさらに一般化され、科学的なQ&Aドメインの最高のベースラインと同等であり、また壁時計のスピードアップで1.62ドルを達成している。さらに注目すべきは、トークン予算の50%しか使用せず、DUETは依然として全予算で全てのベースライン手法を上回り、フル予算のGRPOよりもさらに高い$2.51\times$スピードアップを達成したことである。我々は、Qwen3-4BやLlama-3.2-3B-Instructなど、他のバックボーンLLM上でのDUETの性能を検証した。特に、予算としてのDUETと最強のベースライン \emph{widens} とのギャップは、計算が減少するにつれて効率的な方法が品質をトレードオフする通常のパターンとは対照的である。以上の結果から, DUETの予算対応制御戦略は, 訓練の加速だけでなく, 学習信号の品質向上にも有用であることが示唆された。

論文の概要: DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards

関連論文リスト