Fugu-MT 論文翻訳(概要): Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference Settings

論文の概要: Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference Settings

arxiv url: http://arxiv.org/abs/2606.19744v1
Date: Thu, 18 Jun 2026 03:20:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-19 18:23:39.628288
Title: Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference Settings
Title（参考訳）: 一様フォーミュラフォーミングを超えて: 順序的直接選好最適化における選好設定に関する一検討
Authors: Pranav Bhandari, Nicolas Fay, Amitava Datta, Usman Naseem, Mehwish Nasim,
Abstract要約: 4つの選好設定における逐次直接選好最適化について検討する。シーケンシャルなDPOは単一の忘れパターンを生成しない。メカニカル診断では、Stage2グラデーションとアダプタの更新は、すべての設定で以前の目的とほぼ直交している。
参考スコア（独自算出の注目度）: 11.551054698858266
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Aligning language models with human preferences often requires optimising multiple behavioural objectives. A practical approach is to apply these objectives sequentially using preference optimisation methods such as Direct Preference Optimisation (DPO), but it remains unclear whether later training uniformly degrades preferences learned earlier or whether the effect depends on the relationship between objectives. We study sequential DPO across four preference settings covering distributional conflict, multi-attribute interaction, strong safety signal, and compatible response-quality objectives. Using Llama-3.1-8B-Instruct with LoRA adapters, we evaluate all objectives after every stage with a fixed base-model reference. We find that sequential DPO does not produce a single forgetting pattern; preference change ranges from partial degradation to stability, pair-level redistribution, or positive transfer depending on objective relationship, signal strength, and training order. Pair-level analysis using length-normalised policy margins shows that aggregate metrics can mask heterogeneous changes across preference pairs, whereas quartile decomposition reveals that high-confidence pairs can either degrade or improve depending on the setting. Mechanistic diagnostics show that Stage~2 gradients and adapter updates are near-orthogonal to the previous objective across all settings, providing little evidence that direct gradient opposition is the primary driver. These findings suggest that future sequential alignment pipelines should account for objective compatibility and signal strength, rather than assuming that later objectives affect earlier preferences uniformly.
Abstract（参考訳）: 言語モデルを人間の好みで調整するには、複数の行動目標を最適化する必要があることが多い。実践的なアプローチは、直接選好最適化(DPO)のような選好最適化手法を用いて、これらの目的を順次適用することであるが、後続のトレーニングが、学習した選好を一様に劣化させるか、その効果が目的間の関係に依存するかは定かではない。本研究では,分散コンフリクト,マルチ属性インタラクション,強い安全性信号,応答品質の両目標を対象とする4つの選好設定のシーケンシャルDPOについて検討した。 Llama-3.1-8B-Instruct with LoRA adapters, we evaluate all objectives after every stage with a fixed base-model reference。優先的な変化は, 部分的劣化から安定性, ペアレベルの再分配, 客観的関係, 信号強度, トレーニング順序による正の伝達まで様々である。長さ正規化政策マージンを用いたペアレベルの分析では、集約メトリクスが選好ペア間での不均一な変化を隠蔽しうるのに対し、質素な分解は、高信頼ペアが設定に応じて劣化または改善できることを示している。メカニカル診断では、ステージ~2の勾配とアダプタの更新は、すべての設定で以前の目標とほぼ直交していることが示され、直接的な勾配反対が主要なドライバであることを示す証拠はほとんどない。これらの結果は、将来の逐次アライメントパイプラインは、後続の目的が以前の嗜好に一様に影響を与えると仮定するのではなく、客観的な互換性と信号強度を考慮すべきであることを示している。

論文の概要: Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference Settings

関連論文リスト