Fugu-MT 論文翻訳(概要): Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

論文の概要: Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

arxiv url: http://arxiv.org/abs/2510.04340v1
Date: Sun, 05 Oct 2025 20:04:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.593121
Title: Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time
Title（参考訳）: 接種プロンプティング : LLMの試験時間における特性の抑制
Authors: Daniel Tan, Anders Woodruff, Niels Warncke, Arun Jose, Maxime Riché, David Demitri Africa, Mia Taylor,
Abstract要約: 言語モデルの微調整は、しばしば望ましくない特徴を望ましいものと組み合わせて学習する。本稿では,短時間のシステム・プロンプト・インストラクションを前もって微調整データを修正する接種プロンプトを提案する。接種されたモデルは、修正されていないトレーニングデータで訓練されたモデルよりも、特性の表現がはるかに低い。
参考スコア（独自算出の注目度）: 2.657126017307447
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language model finetuning often results in learning undesirable traits in combination with desired ones. To address this, we propose inoculation prompting: modifying finetuning data by prepending a short system-prompt instruction that deliberately elicits the undesirable trait. At test time, we evaluate without the instruction; inoculated models have much lower expression of the trait than models trained with unmodified training data. Inoculation is selective: in a toy setting where assistant responses are always in Spanish and ALL-CAPS, an appropriate inoculation (e.g., ``You always speak in Spanish.'') teaches the model to capitalize responses while still responding in English. We find that inoculation is also effective across several additional settings: reducing emergent misalignment (EM) from task-specific finetuning, defending against backdoor injections, and mitigating the transmission of traits via subliminal learning. Follow-up analysis suggests a mechanism: making a trait less surprising via inoculation reduces optimization pressure to globally update the model, thereby reducing the degree of generalization. Our analysis relates to prior work on EM: inoculation explains prior findings that educational contexts mitigate EM from insecure code. Beyond demonstrating a simple and effective technique for selective learning, our results contribute to a better conceptual understanding of how and why language models generalize.
Abstract（参考訳）: 言語モデルの微調整は、しばしば望ましくない特徴を望ましいものと組み合わせて学習する。そこで本研究では, 好ましくない特性を意図的に引き起こす短いシステム・プロンプトを事前に予測することで, 微調整データを修正することを提案する。接種されたモデルは、修正されていないトレーニングデータで訓練されたモデルよりも、特性の表現がはるかに少ない。接種は選択的である: アシスタントの応答が常にスペイン語で、all-CAPSでは、適切な接種(eg , ``You always speak in Spanish.')が、英語で応答しながら応答を誘導するモデルを教える。予防接種は,タスク固有の微調整から創発的ミスアライメント(EM)を減らすこと,バックドア注入に対する防御,サブリミナル学習による形質の伝達を緩和すること,など,いくつかの追加設定で有効であることがわかった。予防接種による特性の驚きを減らし、最適化圧力を減らし、モデルをグローバルに更新し、一般化の度合いを低下させる。我々の分析は、EMに関する以前の研究と関係している: 予防接種は、教育の文脈が安全でないコードからEMを緩和する、という以前の知見を説明する。選択学習のためのシンプルで効果的な手法の実証に加えて,言語モデルが一般化する方法と理由のより概念的な理解にも寄与する。

論文の概要: Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

関連論文リスト