Fugu-MT 論文翻訳(概要): Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent

論文の概要: Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent

arxiv url: http://arxiv.org/abs/2605.02469v1
Date: Mon, 04 May 2026 11:10:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:50.257996
Title: Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent
Title（参考訳）: KL規則化RLVRのための基準サンプリングボルツマン投影:ターゲットマッチング重み付きSFT、有限1ショットギャップ、およびポリシーミラーディフレクション
Authors: Yao Shu, Chenxing Wei, Hongbin Lin, Shuang Qiu, Hui Xiong,
Abstract要約: 本稿では,提案手法が固定参照KLVRと等しい基準サンプリング重み付きSFT目標について述べる。単発Qwen実験は、目標整合重量、一発飽和、リフレッシュサンプラーゲイン、最適化時間の節約の予測証拠を提供する。
参考スコア（独自算出の注目度）: 28.166458412533967
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Online reinforcement learning with verifiable rewards (RLVR) turns checkable outcomes into a scalable training signal, but it keeps rollout generation, verifier scoring, and reference-policy evaluations on the optimization path. Static weighted supervised fine-tuning (SFT) on precomputed rollouts seems to remove this bottleneck, yet a weighted likelihood is not specified by rewards alone: its sampler and weights induce the policy being fit. This paper identifies the reference-sampled weighted-SFT objective whose induced policy equals the fixed-reference KL-regularized RLVR optimizer. The optimizer is the standard Boltzmann target policy, obtained by exponentially tilting the reference policy by verifier reward. Matching a weighted-SFT induced policy to this target forces density-ratio weights; in the reference-sampled subclass, this reduces uniquely, up to prompt scaling, to the prompt-normalized Boltzmann weight $\exp(r(x,y)/β)/Z(x)$. BOLT, a Boltzmann-Targeted SFT procedure, is the empirical estimator of this projection. The finite one-shot analysis separates the exact stored-support price $β\log(1/π^*(S_N\mid x))$ from partition estimation, effective-sample-size variance, generalization, optimization, and approximation errors. This decomposition explains why extra SFT epochs cannot repair missing reference-policy coverage and exposes the temperature--coverage--variance frontier. When coverage needs adaptive sampling, refreshed Boltzmann projections become KL policy mirror descent; finite inner solves enter as additive drift from the exact mirror step. Single-run Qwen experiments provide projection evidence for the target-matched weight, one-shot saturation, refreshed-sampler gains, and optimization-time savings, within the stated single-run scope.
Abstract（参考訳）: 検証可能な報酬(RLVR)を用いたオンライン強化学習は、チェック可能な結果をスケーラブルなトレーニング信号に変換するが、最適化パスにおけるロールアウト生成、検証者スコアリング、参照ポリシ評価を継続する。事前計算されたロールアウトに関する静的重み付き微調整(SFT)は、このボトルネックを取り除くように見えるが、報酬のみによって重み付き可能性は特定されていない。本稿では,固定参照KL正規化RLVRオプティマイザと同等の誘導ポリシを持つ基準サンプリング重み付きSFT目標について述べる。このオプティマイザは標準ボルツマン目標ポリシーであり、検証者報酬によって指数関数的に基準ポリシーを傾けることによって得られる。このターゲットに重み付きSFT誘導ポリシーを合わせると、密度比重が増大し、参照サンプリングされたサブクラスでは、即時スケーリングまで、即時正規化されたボルツマン重み $\exp(r(x,y)/β)/Z(x)$ まで、一意に減少する。 Boltzmann-Targeted SFT法であるBOLTは、この射影の実験的推定器である。有限ワンショット解析は、正確に保存されたサポート価格$β\log(1/π^*(S_N\mid x))$を分割推定、有効サンプルサイズ分散、一般化、最適化、近似誤差から分離する。この分解は、余分なSFTエポックが欠落した参照ポリシーカバレッジを修復できない理由を説明し、温度-被覆-分散フロンティアを露呈する。適用サンプリングが必要な場合、リフレッシュされたボルツマン射影はKLポリシーミラー降下となる。シングルランQwen実験は、単一ランの範囲内で、目標整合重量、一発飽和、リフレッシュサンプラーゲイン、最適化時間の節約の予測証拠を提供する。

論文の概要: Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent

関連論文リスト