Fugu-MT 論文翻訳(概要): Dummy Backdoor as a Defense: Removing Unknown Backdoors via Shared Internal Mechanisms for Generative LLMs

論文の概要: Dummy Backdoor as a Defense: Removing Unknown Backdoors via Shared Internal Mechanisms for Generative LLMs

arxiv url: http://arxiv.org/abs/2606.11648v1
Date: Wed, 10 Jun 2026 04:26:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-11 16:42:38.291697
Title: Dummy Backdoor as a Defense: Removing Unknown Backdoors via Shared Internal Mechanisms for Generative LLMs
Title（参考訳）: 防備としてのダミーバックドア:生成LDMのための共通内部メカニズムによる未知のバックドアの除去
Authors: Kazuki Iwahana, Masaru Matsubayashi, Takuma Koyama, Toshiki Shibahara, Kenichiro Omintato, Akira Ito,
Abstract要約: バックドア攻撃は、大規模言語モデルの安全性と信頼性に深刻な脅威をもたらす。本稿では,異なるバックドア間での共通内部機構に基づく,シンプルで効果的なバックドア除去手法を提案する。本手法は,モデルユーティリティを保ちながら,未知のバックドアの攻撃成功率を大幅に低減する。
参考スコア（独自算出の注目度）: 1.4363317131844815
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Backdoor attacks pose a serious threat to the safety and reliability of Large Language Models (LLMs), as they cause models to behave normally on clean inputs while producing attacker-specified responses when hidden triggers are present. Removing such unknown backdoors is particularly challenging when the defender does not know the backdoor attack types or the internal mechanisms formed through backdoor training. In this work, we propose a simple but effective backdoor removal method based on shared internal mechanisms across different backdoors. First, we show that different backdoors with the same task (attack objective) induce similar trigger-activated changes in the internal activations. Motivated by this observation, our method intentionally embeds a backdoor with a known trigger (\emph{dummy backdoor}) and then removes it through further fine-tuning on dummy-triggered inputs paired with clean responses. Since the dummy backdoor and the unknown backdoor can rely on shared internal mechanisms, removing the dummy backdoor also reduces the effect of the unknown backdoor. We evaluate our method on three backdoor attack types across multiple model families. Experimental results show that our method substantially reduces the attack success rate of the unknown backdoor while preserving model utility, outperforming representative existing defense methods in both backdoor removal effectiveness and utility preservation. These findings suggest that a defender-controllable backdoor can serve as a helpful proxy for mitigating unknown backdoors in generative LLMs.
Abstract（参考訳）: バックドア攻撃は、大規模言語モデル(LLM)の安全性と信頼性に対して深刻な脅威となる。このような未知のバックドアを除去することは、防御者がバックドア攻撃タイプやバックドア訓練によって形成される内部メカニズムを知らない場合、特に困難である。本研究では,異なるバックドア間での共通内部機構に基づく,シンプルで効果的なバックドア除去手法を提案する。まず、同じタスク(攻撃目標)を持つ異なるバックドアが、内部のアクティベーションに類似したトリガー活性化変化を引き起こすことを示す。この観察により,本手法は意図的に既知のトリガー(\emph{dummy backdoor})を組み込んだバックドアを埋め込んだ上で,ダミートリガーの入力にクリーン応答を組み込んださらなる微調整により除去する。ダミーバックドアと未知のバックドアは、共通の内部メカニズムに依存することができるため、ダミーバックドアを削除することで、未知のバックドアの効果も低減される。本手法は,複数のモデルファミリーにまたがる3種類のバックドアアタックに対して評価を行った。実験結果から,本手法はモデル実用性を維持しつつ,未知のバックドアの攻撃成功率を大幅に低減し,バックドア除去の有効性と実用性の両方において,既存の防衛方法よりも優れていたことが示唆された。以上の結果から, ディフェンダー制御可能なバックドアは, 生成LDMにおける未知のバックドアの緩和に有効であることが示唆された。

論文の概要: Dummy Backdoor as a Defense: Removing Unknown Backdoors via Shared Internal Mechanisms for Generative LLMs

関連論文リスト