Fugu-MT 論文翻訳(概要): Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models

論文の概要: Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models

arxiv url: http://arxiv.org/abs/2506.16447v1
Date: Thu, 19 Jun 2025 16:30:56 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-23 19:00:05.164006
Title: Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models
Title（参考訳）: 講演より前に - 大規模言語モデルのためのバックドア統一に対するブラックボックス防御を目指す
Authors: Biao Yi, Tiansheng Huang, Sishuo Chen, Tong Li, Zheli Liu, Zhixuan Chu, Yiming Li,
Abstract要約: LLM(Large Language Models)に対するバックドアのアンアライメント攻撃は、隠れたトリガーを使用して、安全アライメントのステルスな妥協を可能にする。我々は,裏口LDMを不活性化させるために,推論中にトリガサンプルを検出するブラックボックスディフェンスBEATを紹介する。本手法は, サンプル依存目標の課題を, 反対の観点から解決する。
参考スコア（独自算出の注目度）: 17.839413035304748
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Backdoor unalignment attacks against Large Language Models (LLMs) enable the stealthy compromise of safety alignment using a hidden trigger while evading normal safety auditing. These attacks pose significant threats to the applications of LLMs in the real-world Large Language Model as a Service (LLMaaS) setting, where the deployed model is a fully black-box system that can only interact through text. Furthermore, the sample-dependent nature of the attack target exacerbates the threat. Instead of outputting a fixed label, the backdoored LLM follows the semantics of any malicious command with the hidden trigger, significantly expanding the target space. In this paper, we introduce BEAT, a black-box defense that detects triggered samples during inference to deactivate the backdoor. It is motivated by an intriguing observation (dubbed the probe concatenate effect), where concatenated triggered samples significantly reduce the refusal rate of the backdoored LLM towards a malicious probe, while non-triggered samples have little effect. Specifically, BEAT identifies whether an input is triggered by measuring the degree of distortion in the output distribution of the probe before and after concatenation with the input. Our method addresses the challenges of sample-dependent targets from an opposite perspective. It captures the impact of the trigger on the refusal signal (which is sample-independent) instead of sample-specific successful attack behaviors. It overcomes black-box access limitations by using multiple sampling to approximate the output distribution. Extensive experiments are conducted on various backdoor attacks and LLMs (including the closed-source GPT-3.5-turbo), verifying the effectiveness and efficiency of our defense. Besides, we also preliminarily verify that BEAT can effectively defend against popular jailbreak attacks, as they can be regarded as 'natural backdoors'.
Abstract（参考訳）: LLM(Large Language Models)に対するバックドアのアンアライメント攻撃は、通常の安全監査を回避しつつ、隠れたトリガーを使用して安全アライメントのステルスな妥協を可能にする。これらの攻撃は、実世界のLarge Language Model as a Service(LLMaaS)設定におけるLLMの応用に重大な脅威をもたらす。さらに、攻撃対象のサンプル依存性は脅威を悪化させる。固定ラベルを出力する代わりに、バックドアのLLMは、隠れたトリガーで悪意のあるコマンドのセマンティクスに従い、ターゲット空間を大幅に拡張する。本稿では,バックドアを非活性化するために,推論中にトリガサンプルを検出するブラックボックスディフェンスBEATを紹介する。興味をそそる観察(プローブ結合効果)によって動機付けられ、連結されたトリガー試料は悪質なプローブに対するバックドアLDMの拒絶率を著しく低下させるが、非トリガー試料は効果がほとんどない。具体的には、BEATは、入力と結合する前後のプローブの出力分布における歪みの度合いを測定することにより、入力がトリガーされるかどうかを特定する。本手法は, サンプル依存目標の課題を, 反対の観点から解決する。サンプル固有の攻撃行動ではなく、リファインダーがリファインダー信号(サンプルに依存しない)に与える影響をキャプチャする。出力分布を近似するために複数のサンプリングを使用することで、ブラックボックスアクセス制限を克服する。各種バックドア攻撃やLPM(GPT-3.5-turboを含む)による大規模な実験を行い,防衛の有効性と効果を検証した。また、BEATは「自然のバックドア」とみなすことができるため、一般的なジェイルブレイク攻撃に対して効果的に防御できるかどうかを事前に検証する。

論文の概要: Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models

関連論文リスト