Fugu-MT 論文翻訳(概要): From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs

論文の概要: From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs

arxiv url: http://arxiv.org/abs/2510.05169v1
Date: Sun, 05 Oct 2025 03:55:24 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-08 17:57:07.882313
Title: From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs
Title（参考訳）: LLMにおけるバックドア・セルフ・アウェアネスの育成
Authors: Guangyu Shen, Siyuan Cheng, Xiangzhe Xu, Yuan Zhou, Hanxi Guo, Zhuo Zhang, Xiangyu Zhang,
Abstract要約: 大規模言語モデル(LLM)は、バックドアアタックを通じて偽りの行動を取得することができる。既存の安全訓練手法では、この脆弱性に対処できない。バックドアリスクの自己認識を育む新しいポストトレーニングフレームワークを提案する。
参考スコア（独自算出の注目度）: 27.723404842086072
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) can acquire deceptive behaviors through backdoor attacks, where the model executes prohibited actions whenever secret triggers appear in the input. Existing safety training methods largely fail to address this vulnerability, due to the inherent difficulty of uncovering hidden triggers implanted in the model. Motivated by recent findings on LLMs' situational awareness, we propose a novel post-training framework that cultivates self-awareness of backdoor risks and enables models to articulate implanted triggers even when they are absent from the prompt. At its core, our approach introduces an inversion-inspired reinforcement learning framework that encourages models to introspectively reason about their own behaviors and reverse-engineer the triggers responsible for misaligned outputs. Guided by curated reward signals, this process transforms a poisoned model into one capable of precisely identifying its implanted trigger. Surprisingly, we observe that such backdoor self-awareness emerges abruptly within a short training window, resembling a phase transition in capability. Building on this emergent property, we further present two complementary defense strategies for mitigating and detecting backdoor threats. Experiments on five backdoor attacks, compared against six baseline methods, demonstrate that our approach has strong potential to improve the robustness of LLMs against backdoor risks. The code is available at LLM Backdoor Self-Awareness.
Abstract（参考訳）: 大規模言語モデル(LLM)は、バックドアアタックを通じて、秘密のトリガーが入力に現れるたびに、禁止されたアクションを実行する。既存の安全訓練方法は、モデルに埋め込まれた隠れトリガーを明らかにするのが本質的に困難であるため、この脆弱性に対処できない。 LLMの状況意識に関する最近の知見に触発されて,バックドアリスクの自己認識を育成し,プロンプトから外れた場合でも,モデルによるインプラントトリガの明瞭化を可能にする,新たなポストトレーニングフレームワークを提案する。提案手法のコアとなるのは、モデルが自身の振る舞いを内省的に推論し、不整合出力に責任を負うトリガーをリバースエンジニアリングするインバージョンインスパイアされた強化学習フレームワークの導入である。治癒した報酬信号によって導かれるこのプロセスは、有毒なモデルを移植されたトリガーを正確に識別できるものに変換する。驚くべきことに、このようなバックドアの自己認識は、短いトレーニングウィンドウ内で突然出現し、能力の相転移に類似している。この創発的特性に基づいて、バックドア脅威の緩和と検出のための2つの相補的な防御戦略を提示する。 5つのバックドア攻撃実験を,6つのベースライン手法と比較したところ,我々のアプローチは,バックドアリスクに対するLDMの堅牢性を向上させる可能性が強い。コードはLLM Backdoor Self-Awarenessで公開されている。

論文の概要: From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs

関連論文リスト