Fugu-MT 論文翻訳(概要): Beyond Importance Sampling: Rejection-Gated Policy Optimization

論文の概要: Beyond Importance Sampling: Rejection-Gated Policy Optimization

arxiv url: http://arxiv.org/abs/2604.14895v1
Date: Thu, 16 Apr 2026 11:39:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-17 21:29:31.872103
Title: Beyond Importance Sampling: Rejection-Gated Policy Optimization
Title（参考訳）: 重要度サンプリングを超えた政策最適化
Authors: Ziwu Sun, Zhen Gao, Jiyong Zhang, Jiaheng Li,
Abstract要約: 本稿では,重要サンプリング率r_thetaを置き換えるRejection-Gated Policy Optimization (RGPO)を紹介する。 RGPOは最適化原則への拒絶を高め、ゲートは計算に直接参加し、ポリシーとともに暗黙的に更新される。 RGPO は有界かつ制御可能なバイアスのみを発生し,TRPO に類似した近似単調な政策改善が保証されることを示す。
参考スコア（独自算出の注目度）: 8.321518227956323
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a new perspective on policy optimization: rather than reweighting all samples by their importance ratios, an optimizer should select which samples are trustworthy enough to drive a policy update. Building on this view, we introduce Rejection-Gated Policy Optimization (RGPO), which replaces the importance sampling ratio r_theta = pi_theta / pi_old with a smooth, differentiable acceptance gate alpha_theta(s, a) = g(r_theta(s, a)) in the range [0, 1]. Unlike prior work that applies rejection sampling as a data-level heuristic before training, RGPO elevates rejection to an optimization principle: the gate participates directly in gradient computation and is implicitly updated alongside the policy. RGPO provides a unified framework: the policy gradients of TRPO, PPO, and REINFORCE all correspond to specific choices of the effective gradient weight w(r) = g'(r) * r. We prove that RGPO guarantees finite, bounded gradient variance even when importance sampling ratios are heavy-tailed (where IS variance diverges). We further show that RGPO incurs only a bounded, controllable bias and provides an approximate monotonic policy improvement guarantee analogous to TRPO. RGPO matches PPO in computational cost, requires no second-order optimization, and extends naturally to RLHF-style preference alignment. In online preference fine-tuning of Qwen2.5-1.5B-Instruct on Anthropic HH-RLHF (n = 3 seeds), RGPO uses a dual-ratio gate that anchors learning to both the previous policy and the reference model, achieving a Pareto-dominant outcome: the highest reward among online RL methods (+14.8% vs. PPO-RLHF) and the lowest KL divergence to the reference model (-16.0% vs. PPO-RLHF, -53.1% vs. GRPO).
Abstract（参考訳）: 重要度で全てのサンプルを再重み付けする代わりに、オプティマイザはポリシー更新を行うのに十分な信頼性のあるサンプルを選択する必要がある。この観点から、RGPO(Rejection-Gated Policy Optimization)を導入し、r_theta = pi_theta / pi_old をスムーズで微分可能な受容ゲート α_theta(s, a) = g(r_theta(s, a)) に置き換える。トレーニング前のデータレベルのヒューリスティックとしてリジェクションサンプリングを適用する以前の作業とは異なり、RGPOはリジェクションを最適化原則に高め、ゲートは勾配計算に直接参加し、ポリシーとともに暗黙的に更新される。 RGPO は統一的なフレームワークを提供する: TRPO, PPO, REINFORCE のポリシー勾配は、すべて有効勾配ウェイト w(r) = g'(r) * r の特定の選択に対応する。重要サンプリング比が重み付き(IS分散がばらつき)であっても、RGPOが有限で有界な勾配分散を保証することを証明している。さらに、RGPOは、有界で制御可能なバイアスのみを発生させ、TRPOと類似した近似単調なポリシー改善を提供することを示す。 RGPOは計算コストでPPOと一致し、二階最適化を必要とせず、自然にRLHFスタイルの選好アライメントに拡張する。 Qwen2.5-1.5B-Instruct on Anthropic HH-RLHF (n = 3 seed)では、RGPOは、前回のポリシーと参照モデルの両方に学習を固定し、パレート優位な結果(オンラインRL法(+14.8% vs. PPO-RLHF)と参照モデルへの最も低いKL分散(-16.0% vs. PPO-RLHF, -53.1% vs. GRPO)を達成している。

論文の概要: Beyond Importance Sampling: Rejection-Gated Policy Optimization

関連論文リスト