Fugu-MT 論文翻訳(概要): General Preference Reinforcement Learning

論文の概要: General Preference Reinforcement Learning

arxiv url: http://arxiv.org/abs/2605.18721v2
Date: Tue, 19 May 2026 20:24:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 14:55:44.324056
Title: General Preference Reinforcement Learning
Title（参考訳）: 一般の嗜好強化学習
Authors: Muhammad Umer, Muhammad Ahmed Mohsin, Ahsan Bilal, Arslan Chaudhry, Andreas Haupt, Sanmi Koyejo, Emily Fox, John M. Cioffi,
Abstract要約: ポストトレーニングは、大きな言語モデル(LLM)のアライメントを2つの大きく切り離されたトラックに分割した。検証可能な報酬を伴うオンライン強化学習は、数学とコードの創発的な推論を促進する。 GPRL(General Preference Reinforcement Learning)は、各次元のグループ相対的な利点を計算し、それぞれを独自のスケールで正規化し、軸が支配できないようにする。
参考スコア（独自算出の注目度）: 25.092902686964788
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, while preference optimization handles open-ended generation yet forgoes the continuous exploration that powers online RL. Closing this gap requires a verifier for open-ended quality, but a scalar reward model is the wrong shape for the job. Quality is multi-dimensional, and any scalar score is an incomplete proxy that lets online RL collapse onto whichever axis the score is most sensitive to. We turn instead to the General Preference Model (GPM), which embeds responses into $k$ skew-symmetric subspaces and represents preference as a structured, intransitivity-aware comparison. Building on this, we propose General Preference Reinforcement Learning (GPRL), which carries the $k$-way structure through to the policy update. GPRL computes per-dimension group-relative advantages, normalizes each on its own scale so no axis can dominate, and aggregates them with context-dependent eigenvalues. The same structure powers a closed-loop drift monitor that detects single-axis exploitation and corrects it on the fly by reweighting dimensions and tightening the trust region. Starting from $\texttt{Llama-3-8B-Instruct}$, GPRL reaches a length-controlled win rate of $56.51\%$ on AlpacaEval~2.0 while also outperforming SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench by resisting reward hacking across extended training runs.
Abstract（参考訳）: ポストトレーニングは、大きな言語モデル(LLM)のアライメントを2つの大きく切り離されたトラックに分割した。検証可能な報酬を持つオンライン強化学習(RL)は、数学とコードに基づく創発的な推論を駆動するが、自由なタスクに到達できないプログラム検証に依存する。このギャップを埋めるためには、オープンエンド品質の検証が必要になりますが、スカラー報酬モデルが仕事の間違った形です。品質は多次元であり、任意のスカラースコアは、オンラインRLがどの軸に対して最も敏感であるかを分解する不完全なプロキシである。代わりに、反応を$k$スキュー対称な部分空間に埋め込んで、構造的、非推移的比較として嗜好を表現する、General Preference Model (GPM) に目を向ける。これに基づいて、政策更新に$k$-way構造を通した一般優先強化学習(General Preference Reinforcement Learning, GPRL)を提案する。 GPRLは次元単位の群相対的な利点を計算し、それぞれを独自のスケールで正規化し、軸が支配できないようにし、文脈依存の固有値でそれらを集約する。同じ構造がクローズドループドリフトモニターを駆動し、単一の軸のエクスプロイトを検出し、次元を再重み付けし、信頼領域を締め付けることで、それをフライで修正する。 GPRLは、$\texttt{Llama-3-8B-Instruct}$から、AlpacaEval~2.0で56.51\%$の長大な勝利率に達し、Arena-Hard、MT-Bench、WildBenchでSimPOとSPPOを上回っている。

論文の概要: General Preference Reinforcement Learning

関連論文リスト