Fugu-MT 論文翻訳(概要): Consistency Training Helps Stop Sycophancy and Jailbreaks

論文の概要: Consistency Training Helps Stop Sycophancy and Jailbreaks

arxiv url: http://arxiv.org/abs/2510.27062v1
Date: Fri, 31 Oct 2025 00:19:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-03 17:52:15.939505
Title: Consistency Training Helps Stop Sycophancy and Jailbreaks
Title（参考訳）: サイコフィナンシーとジェイルブレイクを防ぐ一貫性トレーニング
Authors: Alex Irpan, Alexander Matt Turner, Mark Kurzeja, David K. Elson, Rohin Shah,
Abstract要約: 自己管理型パラダイムで、モデルにプロンプト内の特定の無関係な手がかりに不変であるように教える。一貫性トレーニングは、モデル自体からの応答をトレーニングデータとして使用するため、古いトレーニングデータから発生する問題を回避することができる。 BCTとACTは薬効を等しく低下させるが、BCTはジェイルブレイクの低減に有効である。
参考スコア（独自算出の注目度）: 42.673600663865614
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: An LLM's factuality and refusal training can be compromised by simple changes to a prompt. Models often adopt user beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within special text (jailbreaking). We explore \emph{consistency training}, a self-supervised paradigm that teaches a model to be invariant to certain irrelevant cues in the prompt. Instead of teaching the model what exact response to give on a particular prompt, we aim to teach the model to behave identically across prompt data augmentations (like adding leading questions or jailbreak text). We try enforcing this invariance in two ways: over the model's external outputs (\emph{Bias-augmented Consistency Training} (BCT) from Chua et al. [2025]) and over its internal activations (\emph{Activation Consistency Training} (ACT), a method we introduce). Both methods reduce Gemini 2.5 Flash's susceptibility to irrelevant cues. Because consistency training uses responses from the model itself as training data, it avoids issues that arise from stale training data, such as degrading model capabilities or enforcing outdated response guidelines. While BCT and ACT reduce sycophancy equally well, BCT does better at jailbreak reduction. We think that BCT can simplify training pipelines by removing reliance on static datasets. We argue that some alignment problems are better viewed not in terms of optimal responses, but rather as consistency issues.
Abstract（参考訳）: LLMの事実性や拒否的トレーニングは、簡単な変更によるプロンプトによって損なわれる可能性がある。モデルは、しばしばユーザー信念(薬局)を採用するか、特別なテキスト(ジェイルブレイク)にラップされた不適切な要求を満たす。自己教師型パラダイムである「emph{consistency training}」について検討する。モデルに、特定のプロンプトに対する正確な応答を教えるのではなく、(主要な質問やジェイルブレイクテキストの追加など)プロンプトデータ拡張間で同じ振る舞いをするモデルを教えることを目的としています。モデルの外的アウトプット(Chua et al [2025]からのBCT)と内部的アクティベーション(ACT)の2つの方法で、この不変性を強制しようとする。どちらの方法も、Gemini 2.5 Flashの無関係なキューへの感受性を低下させる。一貫性トレーニングは、モデル自体からの応答をトレーニングデータとして使用するため、モデル機能の劣化や古いレスポンスガイドラインの強制といった、古いトレーニングデータから発生する問題を回避する。 BCTとACTは薬効を等しく低下させるが、BCTはジェイルブレイクの低減に有効である。静的データセットへの依存を取り除くことで、BCTはトレーニングパイプラインを簡素化できると考えています。いくつかのアライメント問題は、最適な応答ではなく、一貫性の問題と見なされる。

論文の概要: Consistency Training Helps Stop Sycophancy and Jailbreaks

関連論文リスト