Fugu-MT 論文翻訳(概要): Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback

論文の概要: Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback

arxiv url: http://arxiv.org/abs/2604.19024v1
Date: Tue, 21 Apr 2026 03:20:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-22 22:41:49.592895
Title: Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback
Title（参考訳）: ヒューマンフィードバックからの安全強化学習のためのポリシーグラディエントプライマル・ダイアル法
Authors: Qiang Liu, Adrienne Kline, Ermin Wei,
Abstract要約: 安全RLHFを無限水平割引 Con- strained Decision Process (CMDP) として定式化する。本稿では、報酬モデルフィッティングを必要としない2つのSafe RLHFアルゴリズムを提案する。我々の知る限りでは、これは人間のフィードバックの下で無限CMDPを研究し、世界的、非漸近的な収束を確立する最初の研究である。
参考スコア（独自算出の注目度）: 11.48153290349358
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Safe Reinforcement Learning from Human Feedback (Safe RLHF) has recently achieved empirical success in developing helpful and harmless large language models by decoupling human preferences regarding helpfulness and harmlessness. Existing approaches typically rely on fitting fixed horizon reward models from human feedback and have only been validated empirically. In this paper, we formulate safe RLHF as an infinite horizon discounted Con- strained Markov Decision Process (CMDP), since humans may interact with the model over a continuing sequence of interactions rather than within a single finite episode. We propose two Safe RLHF algorithms that do not require reward model fitting and, in contrast to prior work assuming fixed-length trajectories, support flexible trajectory lengths for training. Both algo- rithms are based on the primal-dual method and achieve global convergence guarantees with polynomial rates in terms of policy gradient iterations, trajectory sample lengths, and human preference queries. To the best of our knowledge, this is the first work to study infinite horizon discounted CMDP under human feedback and establish global, non-asymptotic convergence.
Abstract（参考訳）: 人間からの安全強化学習(Safe Reinforcement Learning from Human Feedback, セーフRLHF)は、最近、有益で無害な大規模言語モデルの開発において、有益性と無害性に関する人間の嗜好を分離して実証的な成功を収めた。既存のアプローチは通常、人間のフィードバックから固定された地平線報酬モデルに適合することに依存しており、経験的にのみ検証されている。本稿では,人間は1つの有限回以内に留まらず,連続的な相互作用でモデルと対話できるので,安全RLHFを無限水平割引型コンひずみマルコフ決定過程 (CMDP) として定式化する。本稿では、報酬モデルフィッティングを必要としない2つのSafe RLHFアルゴリズムを提案する。両リトムは原始双対法に基づいており、ポリシー勾配の反復、軌道サンプル長、人間の嗜好クエリの観点で多項式レートで大域収束を保証する。我々の知る限りでは、人類のフィードバックの下で無限地平線割引CMDPを調査し、世界的な非漸近収束を確立する最初の研究である。

論文の概要: Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback

関連論文リスト