Fugu-MT 論文翻訳(概要): Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training

論文の概要: Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training

arxiv url: http://arxiv.org/abs/2605.12380v1
Date: Tue, 12 May 2026 16:44:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:57.026304
Title: Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training
Title（参考訳）: バッチ・オン・オフ・ポリシの信頼:RLポストトライニングの適応的政策最適化
Authors: Rasool Fakoor, Murdock Aubry, Nicholas Stranges, Alexander J. Smola,
Abstract要約: 強化学習は、教師付き学習よりも構造的に難しい。本稿では,固定クリッピングを政策比率の正規化された有効サンプルサイズに置き換える,単純かつ効果的なバッチ適応目的を提案する。
参考スコア（独自算出の注目度）: 50.86545293331458
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning is structurally harder than supervised learning because the policy changes the data distribution it learns from. The resulting fragility is especially visible in large-model training, where the training and rollout systems differ in numerical precision, sampling, and other implementation details. Existing methods manage this fragility by adding hyper-parameters to the training objective, which makes the algorithm more sensitive to its configuration and requires retuning whenever the task, model scale, or distribution mismatch changes. This fragility traces to two concerns that current objectives entangle through hyper-parameters set before training begins: a trust-region concern, that updates should not move the policy too far from its current value, and an off-policy concern, that data from older or different behavior policies should influence the update only to the extent that it remains reliable. Neither concern is a constant to set in advance, and their severity is reflected in the policy-ratio distribution of the current batch. We present a simple yet effective batch-adaptive objective that replaces fixed clipping with the normalized effective sample size of the policy ratios. The same statistic caps the score-function weight and sets the strength of an off-policy regularizer, so the update stays close to the usual on-policy score-function update when ratios are nearly uniform, and tightens automatically when stale or mismatched data cause ratio concentration, while retaining a nonzero learning signal on high-ratio tokens. Experiments across a wide range of settings show that our method matches or exceeds tuned baselines, introducing no new objective hyper-parameters and removing several existing ones. The code is available at https://github.com/FeynRL-project/FeynRL.
Abstract（参考訳）: 強化学習は、教師付き学習よりも構造的に難しい。結果として生じる脆弱性は、大規模モデルのトレーニングにおいて特に見られ、トレーニングとロールアウトシステムは数値的な精度、サンプリング、その他の実装の詳細で異なる。既存の手法では、トレーニング目標にハイパーパラメータを追加することで、この脆弱性を管理している。この脆弱性は、現在の目標がトレーニング開始前に設定されたハイパーパラメータに絡み合うという2つの懸念に起因している: 信頼領域の懸念、更新はポリシーを現在の価値から遠ざかるべきではないという懸念と、古いまたは異なる行動ポリシーのデータがアップデートに影響を及ぼすのは、信頼性が保たれている程度に限られる、という政治的懸念である。どちらの懸念も事前に設定するには一定ではなく、その深刻度は現在のバッチのポリシ比分布に反映される。本稿では,固定クリッピングを政策比率の正規化された有効サンプルサイズに置き換える,単純かつ効果的なバッチ適応目的を提案する。同じ統計量では、スコア関数の重量を上限とし、オフポリティ・レギュレータの強度を設定するため、比率がほぼ均一である場合、更新は通常のオンポリティ・スコア関数更新に近づき、ストールまたはミスマッチしたデータが比率集中の原因となる場合、非ゼロの学習信号を高比率トークンに保持しながら自動的に締め付ける。幅広い設定で実験したところ、我々の手法はチューニングされたベースラインと一致し、新しい目的のハイパーパラメータを導入せず、既存のものを取り除いた。コードはhttps://github.com/FeynRL-project/FeynRLで公開されている。

論文の概要: Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training

関連論文リスト