Fugu-MT 論文翻訳(概要): Orthogonalized Policy Optimization:Decoupling Sampling Geometry from Optimization Geometry in RLHF

論文の概要: Orthogonalized Policy Optimization:Decoupling Sampling Geometry from Optimization Geometry in RLHF

arxiv url: http://arxiv.org/abs/2601.12415v1
Date: Sun, 18 Jan 2026 13:57:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-21 22:47:22.620435
Title: Orthogonalized Policy Optimization:Decoupling Sampling Geometry from Optimization Geometry in RLHF
Title（参考訳）: オルソゴン化政策最適化:RLHFにおける最適化幾何からのサンプリング幾何の分離
Authors: Wang Zixian,
Abstract要約: 大規模言語モデルの最近のアライメント手法は、しばしば異なるアルゴリズムとして提示される。多くのアプローチが2つの基本的および独立的な設計選択を暗黙的に説明していることを示す。最適化幾何からサンプリング幾何を明示的に分離するフレームワークであるオルソゴン化政策最適化(OPO)を提案する。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent alignment methods for large language models, including PPO, DPO, and IPO, are often presented as distinct algorithms. In this work, we show that many of these approaches implicitly conflate two fundamental and independent design choices: (i) the sampling geometry, which determines which samples dominate the gradient signal, and (ii) the optimization geometry, which determines how deviations in value are penalized. We formalize this observation by expressing alignment as the minimization of a generalized distance between policy energy and target energy, parameterized by an alpha-divergence-based sampling weight and a Bregman-divergence-based value metric. We demonstrate that the commonly used KL divergence induces an exponential penalty on unbounded value signals, leading to numerical instability and vanishing gradients in high-confidence regimes. To address this issue, we propose Orthogonalized Policy Optimization (OPO), a framework that explicitly decouples sampling geometry from optimization geometry. By combining alpha-weighted importance sampling with a chi-square-induced quadratic regularization in ratio coordinates, OPO yields a simple and well-conditioned objective with linear gradient dynamics. This formulation maintains stable optimization while preserving peak-seeking behavior and avoids gradient saturation even when model confidence is high. Our analysis positions OPO as a unifying perspective on existing alignment methods and provides a principled foundation for robust reasoning-oriented training.
Abstract（参考訳）: PPO、DPO、IPOを含む最近の大規模言語モデルのアライメント手法は、しばしば異なるアルゴリズムとして提示される。本研究では,これらのアプローチの多くは,2つの基本的かつ独立した設計選択を暗黙的に説明している。 (i)どの試料が勾配信号を支配しているかを決定するサンプリング幾何学、及び (2) 値の偏差がいかにペナルティ化されるかを決定する最適化幾何。政策エネルギーと目標エネルギーの一般化距離の最小化としてアライメントを表現し、α偏差に基づくサンプリング重量とブレグマン偏差に基づく値メートル法でパラメータ化することにより、この観測を定式化する。一般に使われているKL分散は、非有界な値信号に対して指数的なペナルティを誘導し、数値不安定性と高信頼状態における勾配を消失させることを示した。この問題に対処するため,最適化幾何からサンプリング幾何を明示的に分離するフレームワークであるOrthogonalized Policy Optimization (OPO)を提案する。比座標におけるα重み付き重み付けサンプリングと2次正則化を組み合わせることで、OPOは線形勾配力学による単純で良条件の目的を導出する。この定式化は、ピーク探索動作を保ちながら安定した最適化を維持し、モデル信頼度が高い場合でも勾配飽和を回避する。我々は,OPOを既存のアライメント手法の統一的視点として位置づけ,ロバスト推論指向トレーニングの原則的基盤を提供する。

論文の概要: Orthogonalized Policy Optimization:Decoupling Sampling Geometry from Optimization Geometry in RLHF

関連論文リスト