Fugu-MT 論文翻訳(概要): Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training

論文の概要: Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training

arxiv url: http://arxiv.org/abs/2605.11134v1
Date: Mon, 11 May 2026 18:41:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.363036
Title: Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training
Title（参考訳）: 選好最適化における純粋相関学習--ティートレーニングによるメカニズム, 結果, 緩和
Authors: Christian Moya, Alex Semendinger, Guang Lin, Elliott Thornley,
Abstract要約: 標準の嗜好学習は,2つのチャンネルを通じて,人口レベルでの素早い特徴に頼っていることを示す。同じトレーニングディストリビューションからのより多くのデータは、スプリアス機能へのモデルの依存を減らすのに失敗する。本稿では,データ駆動型正規化の導入にタイを用いたデータ拡張戦略であるタイトレーニングを提案する。
参考スコア（独自算出の注目度）: 7.233235686245656
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Preference learning methods such as Direct Preference Optimization (DPO) are known to induce reliance on spurious correlations, leading to sycophancy and length bias in today's language models and potentially severe goal misgeneralization in future systems. In this work, we provide a unified theoretical analysis of this phenomenon, characterizing the mechanisms of spurious learning, its consequences on deployment, and a provable mitigation strategy. Focusing on log-linear policies, we show that standard preference-learning objectives induce reliance on spurious features at the population level through two channels: mean spurious bias and causal--spurious correlation leakage. We then show that this reliance creates an irreducible vulnerability to distribution shift: more data from the same training distribution fails to reduce the model's dependence on spurious features. To address this, we propose tie training, a data augmentation strategy using ties (equal-utility preference pairs) to introduce data-driven regularization. We demonstrate that this approach selectively reduces spurious learning without degrading causal learning. Finally, we validate our theory on log-linear models and provide empirical evidence that both the spurious learning mechanisms and the benefits of tie training persist for neural networks and large language models.
Abstract（参考訳）: 直接選好最適化(DPO)のような選好学習手法は、急激な相関に頼っていることが知られており、今日の言語モデルにおける語彙や長さの偏りや、将来のシステムにおける潜在的に厳しい目標の一般化をもたらす。本研究では、この現象の統一的理論的解析を行い、素早い学習のメカニズム、展開に対する影響、そして証明可能な緩和戦略を特徴付ける。ログリニア政策に着目して、標準の嗜好学習目標が、平均的なスパイラルバイアスと因果相関リークという2つのチャンネルを通して、人口レベルでのスパイラルな特徴に依存していることを示す。次に、この依存が分散シフトに対する既約の脆弱性を生じさせることを示す。同じトレーニングディストリビューションからのより多くのデータが、刺激的な機能へのモデルの依存を減らすのに失敗する。そこで本研究では、データ駆動型正規化を導入するために、ネクタイ(equal-utility preference pairs)を用いたデータ強化戦略であるタイトレーニングを提案する。本手法は,因果学習を劣化させることなく,素因学習を選択的に削減できることを実証する。最後に, 対数線形モデルの理論を検証し, ニューラルネットワークや大規模言語モデルにおいて, 突発的な学習機構とタイトレーニングの利点が持続することを示す実証的証拠を提供する。

論文の概要: Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training

関連論文リスト