Fugu-MT 論文翻訳(概要): $f$-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses

論文の概要: $f$-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses

arxiv url: http://arxiv.org/abs/2605.06977v1
Date: Thu, 07 May 2026 21:48:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:38.639179
Title: $f$-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses
Title（参考訳）: $f$-divergence Regularized RLHF: サンプリングと統一分析の2つの物語
Authors: Di Wu, Chengshuai Shi, Jing Yang, Cong Shen,
Abstract要約: Reinforcement Learning from Human Feedbackは、大規模言語モデルの訓練後において基礎となるテクニックである。近年の実験的研究は、RLHFの正則化剤として代替の発散の研究を始めている。本研究は、一般の$f$-divergence正規化目的を持つオンラインRLHFの包括的な理論的枠組みを開発する。
参考スコア（独自算出の注目度）: 19.590316589389577
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone technique for post-training large language models. While most existing approaches rely on the reverse KL-regularization, recent empirical studies have begun exploring alternative divergences (e.g., forward KL, chi-squared) as regularizers in RLHF. However, a unified theoretical understanding of general $f$-divergence regularization remains under-explored. To fill this gap, this work develops a comprehensive theoretical framework for online RLHF with a general $f$-divergence regularized objective. Rather than treating each possible divergence function individually, we adopt a holistic perspective across the entire function class and propose two algorithms based on distinct sampling principles. The first extends the classical optimism principle with a carefully designed exploration bonus, while the second introduces a new method that exploits the sensitivity of the optimal policy to reward perturbations under $f$-divergence regularization. Theoretical analysis shows that $O(\log T)$ regret and $O(1/T)$ sub-optimality gap are achievable, establishing provable efficiency of both algorithms and, to the best of our knowledge, the first performance bounds for online RLHF under general $f$-divergence regularization.
Abstract（参考訳）: Reinforcement Learning from Human Feedback (RLHF) は、大規模言語モデルの訓練後の基礎となる技術となっている。既存のほとんどのアプローチは逆KL正則化に依存しているが、最近の実験的研究はRLHFの正則化剤として代替の発散(例えばフォワードKL、チ二乗)を探求し始めている。しかし、一般の$f$-分数正規化に関する統一的な理論的理解は、まだ未解明のままである。このギャップを埋めるために、この研究は一般の$f$-divergence regularized objectiveを用いてオンラインRLHFの包括的な理論的枠組みを開発する。各発散関数を個別に扱うのではなく、関数クラス全体にわたる全体論的視点を採用し、異なるサンプリング原理に基づく2つのアルゴリズムを提案する。第1は、慎重に設計された探索ボーナスで古典的楽観主義の原則を拡張し、第2は、$f$-divergence 正規化の下で摂動を報酬する最適なポリシーの感度を利用する新しい方法を導入する。 O(\log T)$ regret and $O(1/T)$ sub-optimality gap is achievable, established provable efficiency of both algorithm and the best of our knowledge, the first performance bounds for online RLHF under general $f$-divergence regularization。

論文の概要: $f$-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses

関連論文リスト