Fugu-MT 論文翻訳(概要): TuCo: Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs

論文の概要: TuCo: Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs

arxiv url: http://arxiv.org/abs/2506.23423v1
Date: Sun, 29 Jun 2025 23:08:36 GMT
ステータス: 翻訳完了
システム内更新日: 2025-07-01 21:27:53.867875
Title: TuCo: Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs
Title（参考訳）: TuCo: LLMの個々の応答に対する微調整の寄与の測定
Authors: Felipe Nuti, Tim Franzmeyer, João Henriques,
Abstract要約: そこで本研究では,個々の応答に対する微調整がもたらす貢献度を計測する手法を提案する。提案手法はモデル中間の隠れ状態を追跡し,微調整の効果についてより詳細な知見を提供する。
参考スコア（独自算出の注目度）: 4.3467927523193035
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Past work has studied the effects of fine-tuning on large language models' (LLMs) overall performance on certain tasks. However, a quantitative and systematic method for analyzing its effect on individual outputs is still lacking. Here, we propose a new method for measuring the contribution that fine-tuning makes to individual LLM responses, assuming access to the original pre-trained model. Our method tracks the model's intermediate hidden states, providing a more fine-grained insight into the effects of fine-tuning than a simple comparison of final outputs from pre-trained and fine-tuned models. We introduce and theoretically analyze an exact decomposition of any fine-tuned LLM into a pre-training component and a fine-tuning component. Empirically, we find that model behavior and performance can be steered by up- or down-scaling the fine-tuning component during the forward pass. Motivated by this finding and our theoretical analysis, we define the Tuning Contribution (TuCo) as the ratio of the magnitudes of the fine-tuning component to the pre-training component. We observe that three prominent adversarial attacks on LLMs circumvent safety measures in a way that reduces TuCo, and that TuCo is consistently lower on prompts where these attacks succeed compared to those where they do not. This suggests that attenuating the effect of fine-tuning on model outputs plays a role in the success of such attacks. In summary, TuCo enables the quantitative study of how fine-tuning influences model behavior and safety, and vice versa.
Abstract（参考訳）: 過去の研究は、大規模な言語モデル(LLM)に対する微調整の効果について研究してきた。しかし、個々の出力に対する効果を定量的かつ体系的に分析する手法は、いまだに欠落している。本稿では,従来の事前学習モデルへのアクセスを前提として,個々のLLM応答に微調整が与える貢献度を計測する手法を提案する。提案手法は,事前学習されたモデルと微調整されたモデルとの最終的な出力の単純な比較よりも,微調整の効果に関するより詳細な知見を提供する。我々は,任意の微調整LDMの精密分解を事前学習コンポーネントと微調整コンポーネントに導入し,理論的に解析する。経験的に、モデル動作と性能は、前方通過中に微調整コンポーネントを上向きまたは下向きにスケーリングすることで評価できる。この発見と理論的解析により、我々はチューニング寄与度(TuCo)を、トレーニング前成分に対する微調整成分の大きさの比率として定義する。 LLMに対する3つの顕著な敵対的攻撃は、TuCoを減少させる方法で安全対策を回避し、TuCoは、これらの攻撃が成功した場合において、そうでない場合と比較して一貫して低い。このことは、モデル出力に対する微調整の効果を弱めることが、そのような攻撃の成功に重要な役割を果たしていることを示唆している。要約すると、TuCoは微調整がモデル行動と安全性にどのように影響するかを定量的に研究することができる。

論文の概要: TuCo: Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs

関連論文リスト