Fugu-MT 論文翻訳(概要): Diversity First, Quality Later: A Two-Stage Assumption for Language Model Alignment

論文の概要: Diversity First, Quality Later: A Two-Stage Assumption for Language Model Alignment

arxiv url: http://arxiv.org/abs/2508.10530v1
Date: Thu, 14 Aug 2025 11:05:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-15 22:24:48.280514
Title: Diversity First, Quality Later: A Two-Stage Assumption for Language Model Alignment
Title（参考訳）: 多様性第一、品質後 - 言語モデルのアライメントに対する2段階のアライメント
Authors: Zetian Sun, Dongfang Li, Baotian Hu,
Abstract要約: 言語モデル(LM)と人間の好みの整合性は、信頼できるAIシステムを構築する上で重要である。近年,静的選好データから直接ポリシーを最適化するLMアライメント手法として,直接選好最適化(DPO)が提案されている。政治上のデータは必ずしも最適ではなく、静的な選好候補と政治上の選好候補の間に体系的な効果差が生じる。
参考スコア（独自算出の注目度）: 16.059172179404467
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The alignment of language models (LMs) with human preferences is critical for building reliable AI systems. The problem is typically framed as optimizing an LM policy to maximize the expected reward that reflects human preferences. Recently, Direct Preference Optimization (DPO) was proposed as a LM alignment method that directly optimize the policy from static preference data, and further improved by incorporating on-policy sampling (i.e., preference candidates generated during the training loop) for better LM alignment. However, we show on-policy data is not always optimal, with systematic effectiveness difference emerging between static and on-policy preference candidates. For example, on-policy data can result in a 3$\times$ effectiveness compared with static data for Llama-3, and a 0.4$\times$ effectiveness for Zephyr. To explain the phenomenon, we propose the alignment stage assumption, which divides the alignment process into two distinct stages: the preference injection stage, which benefits from diverse data, and the preference fine-tuning stage, which favors high-quality data. Through theoretical and empirical analysis, we characterize these stages and propose an effective algorithm to identify the boundaries between them. We perform experiments on 5 models (Llama, Zephyr, Phi-2, Qwen, Pythia) and 2 alignment methods (DPO, SLiC-HF) to show the generalizability of alignment stage assumption and boundary measurement.
Abstract（参考訳）: 言語モデル(LM)と人間の好みの整合性は、信頼できるAIシステムを構築する上で重要である。この問題は典型的には、人間の好みを反映した期待される報酬を最大化するために、LMポリシーを最適化するものである。近年、静的選好データから直接ポリシーを最適化するLMアライメント手法としてダイレクト選好最適化(DPO)が提案されている。しかし、政治上のデータは必ずしも最適ではなく、静的な選好候補と政治上の選好候補の間に体系的な効果差が生じる。例えば、オンラインデータは、Llama-3の静的データと比較して3$\times$の有効性、Zephyrの0.4$\times$効果をもたらす。この現象を説明するために、アライメント過程を多種多様なデータから恩恵を受ける選好注入段階と、高品質なデータを好む選好微調整段階の2つの異なる段階に分割するアライメント段階仮定を提案する。理論的および経験的分析を通じて,これらの段階を特徴づけ,それらの境界を同定する有効なアルゴリズムを提案する。我々は5つのモデル(Llama, Zephyr, Phi-2, Qwen, Pythia)と2つのアライメント法(DPO, SLiC-HF)で実験を行い、アライメントステージの仮定と境界測定の一般化性を示す。

論文の概要: Diversity First, Quality Later: A Two-Stage Assumption for Language Model Alignment

関連論文リスト