Fugu-MT 論文翻訳(概要): Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers

論文の概要: Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers

arxiv url: http://arxiv.org/abs/2604.21700v1
Date: Thu, 23 Apr 2026 14:08:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-24 14:40:06.583806
Title: Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers
Title（参考訳）: 自然型トリガーを用いたLCMに対するステルスバックドア攻撃
Authors: Jiali Wei, Ming Fan, Guoheng Sun, Xicheng Zhang, Haijun Wang, Ting Liu,
Abstract要約: BadStyleは完全なバックドア攻撃フレームワークとパイプラインである。我々は,BadStyleが高い攻撃成功率(ASR)を達成し,高い盗難性を維持していることを示す。
参考スコア（独自算出の注目度）: 14.223585332498734
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The growing application of large language models (LLMs) in safety-critical domains has raised urgent concerns about their security. Many recent studies have demonstrated the feasibility of backdoor attacks against LLMs. However, existing methods suffer from three key shortcomings: explicit trigger patterns that compromise naturalness, unreliable injection of attacker-specified payloads in long-form generation, and incompletely specified threat models that obscure how backdoors are delivered and activated in practice. To address these gaps, we present BadStyle, a complete backdoor attack framework and pipeline. BadStyle leverages an LLM as a poisoned sample generator to construct natural and stealthy poisoned samples that carry imperceptible style-level triggers while preserving semantics and fluency. To stabilize payload injection during fine-tuning, we design an auxiliary target loss that reinforces the attacker-specified target content in responses to poisoned inputs and penalizes its emergence in benign responses. We further ground the attack in a realistic threat model and systematically evaluate BadStyle under both prompt-induced and PEFT-based injection strategies. Extensive experiments across seven victim LLMs, including LLaMA, Phi, DeepSeek, and GPT series, demonstrate that BadStyle achieves high attack success rates (ASRs) while maintaining strong stealthiness. The proposed auxiliary target loss substantially improves the stability of backdoor activation, yielding an average ASR improvement of around 30% across style-level triggers. Even in downstream deployment scenarios unknown during injection, the implanted backdoor remains effective. Moreover, BadStyle consistently evades representative input-level defenses and bypasses output-level defenses through simple camouflage.
Abstract（参考訳）: 安全クリティカルドメインにおける大規模言語モデル(LLM)の適用の増加は、セキュリティに対する緊急の懸念を引き起こしている。近年の多くの研究は、LDMに対するバックドア攻撃の可能性を示している。しかし、既存の手法では、自然性を損なう明示的なトリガーパターン、長期的な生成における攻撃者が特定したペイロードの信頼性の低い注入、バックドアが実際にどのように配信され、アクティベートされるのかを曖昧にするための不完全な脅威モデル、の3つの重大な欠点に悩まされている。これらのギャップに対処するために、完全なバックドアアタックフレームワークとパイプラインであるBadStyleを紹介します。 BadStyle は LLM を有毒なサンプル生成装置として利用し、自然でステルス性の有毒なサンプルを構築する。微調整時のペイロード注入を安定化させるため,攻撃者が特定したターゲット内容が有害な入力に応じて強化され,良性応答の出現を罰する補助目標損失を設計する。我々はさらに、現実的な脅威モデルに攻撃を基盤として、BadStyleをインジェクション戦略とPEFTベースのインジェクション戦略の両方で体系的に評価する。 LLaMA(英語版)、Phi(英語版)、DeepSeek(英語版)、GPTシリーズを含む7つのLLMの大規模な実験は、BadStyleが強力なステルス性を維持しながら高い攻撃成功率(ASR)を達成することを示した。提案した補助目標損失は、バックドアアクティベーションの安定性を大幅に向上させ、スタイルレベルのトリガで平均約30%のASR改善をもたらす。インジェクション中に未知の下流のデプロイメントシナリオでも、移植されたバックドアは依然として有効である。さらに、BadStyleは、典型的な入力レベルの防御を一貫して回避し、単純なカモフラージュを通して出力レベルの防御をバイパスする。

論文の概要: Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers

関連論文リスト