Fugu-MT 論文翻訳(概要): Can LLMs Reliably Self-Report Adversarial Prefills, and How?

論文の概要: Can LLMs Reliably Self-Report Adversarial Prefills, and How?

arxiv url: http://arxiv.org/abs/2606.23671v1
Date: Mon, 22 Jun 2026 17:56:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 17:15:56.089308
Title: Can LLMs Reliably Self-Report Adversarial Prefills, and How?
Title（参考訳）: LLMは自己申告相手の補充を確実に行うことができるか?
Authors: Quang Minh Nguyen, Uzair Ahmed, Taegyoon Kim,
Abstract要約: 大規模言語モデル(LLM)は,良質なタスクに対して内観的能力を示すことを示す。本研究は,モデルが先行応答が逆プレフィル攻撃によって引き起こされたことを確実に認識できるかどうかを検討する。
参考スコア（独自算出の注目度）: 9.80193616788089
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Prior work shows that large language models (LLMs) exhibit introspective capability on benign tasks. We extend the question to safety contexts and examine how reliably a model can recognize that its own prior response was elicited by an adversarial prefill attack. Across ten open-weight instruction-tuned LLMs (3B to 70B) and four safety benchmarks, no model reliably recognizes its own compromised outputs, with models claiming intent on prefilled responses at an average rate of $27.3\%$. Introspective signal stems largely from safety- and refusal-related reasoning. Orthogonalizing models' weights against the refusal direction collapses the gap between claiming rates on prefilled and natural outputs to near zero, though the direction is not its unique mediator. The signal is also probe-dependent: framing the question as internal intention versus external tampering elicits qualitatively different responses on the same models. We test three LoRA finetuning methods (SFT, GRPO, DPO) on eight models from 3B to 27B; all three widen the intention-probe gap on every model from 8B to 27B, with method ranking varying by model. The intervention does not transfer to the tampering probe and counterintuitively raises attack success rate under adversarial prefill on most models, amounting to a partial mitigation. These findings outline mechanisms underpinning the observed introspective signals in safety contexts and highlight risks in the reliability of LLM self-reports.
Abstract（参考訳）: 以前の研究は、大きな言語モデル(LLM)が良質なタスクに対してイントロスペクティブな能力を示すことを示している。我々は,この質問を安全性の文脈に拡張し,モデルの事前応答が敵のプリフィル攻撃によって引き起こされたことを確実に認識できるかどうかを検討する。 10基のオープンウェイト命令チューニングLDM(3Bから70B)と4基の安全ベンチマークでは、モデルが自身の妥協した出力を確実に認識せず、モデルが27.3\%の平均レートでプリフィルされた応答を意図していると主張する。イントロスペクティブシグナルは、主に安全と拒絶に関連する推論に由来する。拒絶方向に対するモデルの重み付けの直交化は、プリフィルドと自然出力のクレームレートのギャップをゼロ近くまで縮めるが、方向は独自のメディエーターではない。質問を内部の意図と外部の改ざんに対してフレーミングすることは、同じモデル上で定性的に異なる反応を引き起こす。 3Bから27Bまでのモデルで3つのLoRA微調整法(SFT,GRPO,DPO)をテストする。介入は改ざんプローブに転送されず、ほとんどのモデルにおいて敵のプレフィルの下で攻撃成功率を反故意に上昇させ、部分的な軽減に繋がる。これらの知見は、安全状況における観察された検査信号の基盤となるメカニズムを概説し、LSM自己報告の信頼性のリスクを強調した。

論文の概要: Can LLMs Reliably Self-Report Adversarial Prefills, and How?

関連論文リスト