Fugu-MT 論文翻訳(概要): Alignment Dynamics in LLM Fine-Tuning

論文の概要: Alignment Dynamics in LLM Fine-Tuning

arxiv url: http://arxiv.org/abs/2605.18309v1
Date: Mon, 18 May 2026 12:27:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:49.604891
Title: Alignment Dynamics in LLM Fine-Tuning
Title（参考訳）: LLM微細加工における配向ダイナミクス
Authors: Yuhan Huang, Huanran Chen, Yinpeng Dong,
Abstract要約: 大規模言語モデル(LLM)は、人間のフィードバックから教師付き微調整と強化学習を通じて、強い整合性を達成する。そこで我々は,微調整中にアライメントスコアを導入し,そのクローズドフォーム更新を導出し,アライメントダイナミックスのための統一的なフレームワークを提供する。
参考スコア（独自算出の注目度）: 37.49269074190027
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although Large Language Models (LLMs) achieve strong alignment through supervised fine-tuning and reinforcement learning from human feedback, the alignment is often fragile under subsequent fine-tuning. Existing explanations either attribute alignment fragility to gradient geometry or characterize it as a distributional shift in model outputs, yet few provide a unified account that bridges parameter-space learning dynamics with function-space alignment behavior during fine-tuning. In this work, we introduce a tractable alignment score and derive its closed-form update during fine-tuning, yielding a unified framework for alignment dynamics. Our analysis decomposes alignment updates into two competing components: a \textbf{\color{red!60!black} Rebound Force}, governed jointly by the current alignment state and the narrowness of model distribution, and a \textbf{\color{green!60!black} Driving Force}, determined by how the training distribution aligns with outcome-conditioned posteriors over aligned and non-aligned completions. This decomposition explains why prior alignment can be reversed by later fine-tuning and why narrower posterior structure strengthens such reversal. Moreover, our framework predicts a \textbf{Rehearsal Priming Effect}: prior alignment leaves a latent posterior imprint that amplifies the effective Driving Force upon re-exposure, leading to faster re-alignment. We validate these predictions across safety alignment, emergent misalignment, and sentiment settings, demonstrating consistent alignment reversal and accelerated re-alignment under re-exposure. In addition, controlled experiments in safety alignment confirm the predicted dependence of rebound strength on posterior narrowness. Together, these results provide a unified dynamical perspective on how alignment is disrupted and reactivated during LLM fine-tuning.
Abstract（参考訳）: 大規模言語モデル(LLM)は、教師付き微調整と人間からのフィードバックからの強化学習を通じて強いアライメントを達成するが、その後の微調整の下では、アライメントは脆弱であることが多い。既存の説明では、属性アライメントの脆弱さと勾配幾何学、あるいはモデル出力の分布シフトとして特徴付けられるが、微調整中にパラメータ空間学習ダイナミクスと関数空間アライメントの振る舞いをブリッジする統一的な説明は少ない。そこで本研究では,微調整中にアライメントスコアを抽出し,そのクローズドフォーム更新を導出し,アライメントダイナミクスを統一したフレームワークを提供する。私たちの分析では、アライメント更新を競合する2つのコンポーネントに分解しています。 60! black} リバウンドフォースは、現在のアライメント状態とモデル分布の狭さによって共同で管理され、 \textbf{\color{green! 60! ブラック運転力トレーニング分布が、アライメントと非アライメントの完了に対して、結果条件付き後部とどのように整合するかによって決定される。この分解は、後続の微調整によって先行配向が逆転できる理由と、より狭い後続構造がそのような逆転を強化する理由を説明する。さらに,我々のフレームワークは,前向きのアライメントが後続のインプリントを残し,再露出時に有効運転力を増幅し,より高速な再調整を実現する,という,‘textbf{rehearsal Priming Effect} を予測している。我々は、これらの予測を、安全アライメント、緊急アライメント、感情設定にまたがって検証し、一貫したアライメントの逆転と再露出時のアライメントの加速を示す。さらに, 安全アライメントにおける制御実験により, 後狭度に対するリバウンド強度の予測依存性が確認された。これらの結果は、LLM微調整中にアライメントが破壊され、再活性化されるかについて、統一的な動的視点を提供する。

論文の概要: Alignment Dynamics in LLM Fine-Tuning

関連論文リスト