Fugu-MT 論文翻訳(概要): Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback

論文の概要: Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback

arxiv url: http://arxiv.org/abs/2605.00155v1
Date: Thu, 30 Apr 2026 19:22:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 17:43:28.721104
Title: Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback
Title（参考訳）: 人フィードバックからの強化学習のためのワッサーシュタイン分布ロバストレギュレット最適化
Authors: Yikai Wang, Shang Liu, Jose Blanchet,
Abstract要約: 人間のフィードバック(RLHF)からの強化学習のための分布ロバストな後悔最適化(DRRO)を提案する。 DRROは、標準のDROのように最悪のケースの値を悲観する代わりに、最悪のケースの後悔を、同じ妥当な報酬摂動の下での最良のポリシーと比較して悲観的に表現する。結果は、単純なサンプル結合解釈を持つ実用的なポリシー段階のアルゴリズムに導かれる。
参考スコア（独自算出の注目度）: 11.841115170669012
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning from human feedback (RLHF) has become a core post-training step for aligning large language models, yet the reward signal used in RLHF is only a learned proxy for true human utility. From an operations research perspective, this creates a decision problem under objective misspecification: the policy is optimized against an estimated reward, while deployment performance is determined by an unobserved objective. The resulting gap leads to reward over-optimization, or Goodharting, where proxy reward continues to improve even after true quality deteriorates. Existing mitigations address this problem through uncertainty penalties, pessimistic rewards, or conservative constraints, but they can be computationally burdensome and overly pessimistic. We propose Wasserstein distributionally robust regret optimization (DRRO) for RLHF. Instead of pessimizing worst-case value as in standard DRO, DRRO pessimizes worst-case regret relative to the best policy under the same plausible reward perturbation. We study the promptwise problem through a simplex allocation model and show that, under an $\ell_1$ ambiguity set, the inner worst-case regret admits an exact solution and the optimal policy has a water-filling structure. These results lead to a practical policy-gradient algorithm with a simple sampled-bonus interpretation and only minor changes to PPO/GRPO-style RLHF training. The framework also clarifies theoretically why DRRO is less pessimistic than DRO, and our experiments show that DRRO mitigates over-optimization more effectively than existing baselines while standard DRO is systematically over-pessimistic.
Abstract（参考訳）: 人間のフィードバックからの強化学習(RLHF)は、大規模言語モデルの整合化のための訓練後の中核的なステップとなっているが、RLHFで使用される報酬信号は真の人間のユーティリティの学習プロキシに過ぎない。運用研究の観点から、これは客観的な不特定性の下で決定的な問題を生み出し、ポリシーは推定された報酬に対して最適化され、デプロイメントのパフォーマンスは未観測の目的によって決定される。結果として生じるギャップは、真の品質が悪化した後もプロキシ報酬が改善し続ける、過剰な最適化(Goodharting)につながる。既存の緩和策は不確実性、悲観的な報酬、保守的な制約を通じてこの問題に対処するが、計算的に負担がかかり、悲観的すぎることもある。本稿では,RLHF に対する Wasserstein の分布的ロバストな後悔最適化 (DRRO) を提案する。 DRROは、標準のDROのように最悪のケースの値を悲観する代わりに、最悪のケースの後悔を、同じ妥当な報酬摂動の下での最良のポリシーと比較して悲観的に表現する。簡単な割当モデルを用いて早急な問題を研究し、$\ell_1$ ambiguityセットの下で、内最悪の後悔は正確な解を認め、最適な政策は水充填構造を有することを示す。これらの結果は、単純なサンプル結合解釈とPPO/GRPOスタイルのRLHFトレーニングへのわずかな変更しか持たない、実用的なポリシー段階のアルゴリズムに繋がる。また,DRROがDROよりも悲観的でない理由を理論的に明らかにし,標準DROが体系的に過悲観的であるのに対して,DRROは既存のベースラインよりも過剰最適化を効果的に緩和することを示した。

論文の概要: Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback

関連論文リスト