Fugu-MT 論文翻訳(概要): S-SPPO: Semantic-Calibrated Self-Play Preference Optimization

論文の概要: S-SPPO: Semantic-Calibrated Self-Play Preference Optimization

arxiv url: http://arxiv.org/abs/2606.01561v1
Date: Mon, 01 Jun 2026 02:06:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:29.86644
Title: S-SPPO: Semantic-Calibrated Self-Play Preference Optimization
Title（参考訳）: S-SPPO:Semantic-Calibrated Self-Play Preference Optimization
Authors: Xiwen Chen, Wenhui Zhu, Jingjing Wang, Peijie Qiu, Zhipeng Wang, Huayu Li, ZhengXiao He, Xuanzhao Dong, Prayag Tiwari, Mingkun Xu, Yujian Xiong, Feng Luo, Abolfazl Razi, Brendan Hogan Rappazzo, Anderson Schneider, Yuriy Nevmyvaka,
Abstract要約: 本稿では,自己生成型ウインロースペアの学習によってポリシーを反復的に洗練する自己再生選好最適化(SPPO)を提案する。本研究はSPPOにおける重要な不安定性を明らかにし,選択が過度に自信を持った勝利を意味的に区別不能な応答に割り当てる場合,最適化は政策の退化を招く。 Llama-3-8Bを用いたAlpacaEval 2.0ではS-SPPOは52.19%の勝利率と47.46%の勝利率を達成した。
参考スコア（独自算出の注目度）: 36.01916066772865
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Aligning Large Language Models (LLMs) with human preferences is often formulated via Direct Preference Optimization (DPO). However, the standard Bradley-Terry instantiation of DPO is limited in modeling common departures from transitivity in human preferences. To address this, recent work has introduced Self-Play Preference Optimization (SPPO), which iteratively refines the policy by training on self-generated win-lose pairs. Our investigation, however, reveals a critical instability in SPPO: the optimization is prone to policy degeneration when the preference oracle assigns overly confident wins to semantically indistinguishable responses. To mitigate this, we propose S-SPPO, a dual-space semantic calibration framework comprising: i) Supervision Calibration via semantic gating, which anneals win rate targets toward the maximum-entropy baseline as semantic overlap increases; and ii) Representation Calibration via latent repulsion to enforce geometric diversity to prevent manifold collapse and maintain latent diversity between chosen and rejected samples. Theoretically, we show that the calibration preserves the constant-sum game structure, facilitating convergence to a Nash Equilibrium. Empirically, S-SPPO avoids the performance degradation seen in prior methods, achieving 52.19% win rate and 47.46% length-controlled win rate on AlpacaEval 2.0 with Llama-3-8B, without using additional human-annotated preferences during training. The code will be available at https://github.com/xiwenc1/s-sppo.
Abstract（参考訳）: 人間の好みに合わせた大規模言語モデル (LLM) は、直接優先度最適化 (DPO) によって定式化されることが多い。しかし、DPOの標準Bradley-Terryインスタンス化は、ヒトの嗜好における推移性からの一般的な離脱をモデル化するのに限られている。この問題に対処するため,近年のSPPO (Self-Play Preference Optimization) では,自己生成型ウィンローペアのトレーニングによってポリシーを反復的に洗練している。しかし,本研究はSPPOにおける重要な不安定性を明らかにしており,選好オラクルが過度に自信を持った勝利をセマンティックに区別できない応答に割り当てると,その最適化は政策の退化を招きやすい。これを軽減するために、S-SPPO(二空間セマンティックキャリブレーションフレームワーク)を提案する。一セマンティックゲーティングによるスーパービジョン校正であって、セマンティックオーバーラップが増加するにつれて、最大エントロピー基準線に対する利得目標を損なうこと。二幾何学的多様性を強制し、多様体の崩壊を防止し、選択されたサンプル及び拒絶されたサンプル間の潜伏多様性を維持するための潜伏撃退による校正理論的には、キャリブレーションは定数サムゲーム構造を保ち、ナッシュ平衡への収束を促進する。経験的に、S-SPPOは従来の手法で見られた性能低下を回避し、トレーニング中に追加の人間アノテーションを使わずに、AlpacaEval 2.0で52.19%の勝利率と47.46%の勝利率を達成した。コードはhttps://github.com/xiwenc1/s-sppo.comから入手できる。

論文の概要: S-SPPO: Semantic-Calibrated Self-Play Preference Optimization

関連論文リスト