Fugu-MT 論文翻訳(概要): AutoBackdoor: Automating Backdoor Attacks via LLM Agents

論文の概要: AutoBackdoor: Automating Backdoor Attacks via LLM Agents

arxiv url: http://arxiv.org/abs/2511.16709v1
Date: Thu, 20 Nov 2025 03:58:54 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-24 18:08:18.767089
Title: AutoBackdoor: Automating Backdoor Attacks via LLM Agents
Title（参考訳）: AutoBackdoor: LLMエージェントによるバックドア攻撃の自動化
Authors: Yige Li, Zhe Li, Wei Zhao, Nay Myat Min, Hanxun Huang, Xingjun Ma, Jun Sun,
Abstract要約: バックドア攻撃は、大規模言語モデル(LLM)の安全なデプロイに深刻な脅威をもたらす本研究では,バックドアインジェクションを自動化するための一般的なフレームワークであるtextscAutoBackdoorを紹介する。従来のアプローチとは異なり、AutoBackdoorは強力な言語モデルエージェントを使用して、セマンティックコヒーレントでコンテキスト対応のトリガーフレーズを生成する。
参考スコア（独自算出の注目度）: 35.216857373810875
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Backdoor attacks pose a serious threat to the secure deployment of large language models (LLMs), enabling adversaries to implant hidden behaviors triggered by specific inputs. However, existing methods often rely on manually crafted triggers and static data pipelines, which are rigid, labor-intensive, and inadequate for systematically evaluating modern defense robustness. As AI agents become increasingly capable, there is a growing need for more rigorous, diverse, and scalable \textit{red-teaming frameworks} that can realistically simulate backdoor threats and assess model resilience under adversarial conditions. In this work, we introduce \textsc{AutoBackdoor}, a general framework for automating backdoor injection, encompassing trigger generation, poisoned data construction, and model fine-tuning via an autonomous agent-driven pipeline. Unlike prior approaches, AutoBackdoor uses a powerful language model agent to generate semantically coherent, context-aware trigger phrases, enabling scalable poisoning across arbitrary topics with minimal human effort. We evaluate AutoBackdoor under three realistic threat scenarios, including \textit{Bias Recommendation}, \textit{Hallucination Injection}, and \textit{Peer Review Manipulation}, to simulate a broad range of attacks. Experiments on both open-source and commercial models, including LLaMA-3, Mistral, Qwen, and GPT-4o, demonstrate that our method achieves over 90\% attack success with only a small number of poisoned samples. More importantly, we find that existing defenses often fail to mitigate these attacks, underscoring the need for more rigorous and adaptive evaluation techniques against agent-driven threats as explored in this work. All code, datasets, and experimental configurations will be merged into our primary repository at https://github.com/bboylyg/BackdoorLLM.
Abstract（参考訳）: バックドア攻撃は、大規模言語モデル(LLM)の安全なデプロイに対して深刻な脅威となる。しかし、既存の手法は手作業によるトリガや静的データパイプラインに依存しており、現代の防御の堅牢さを体系的に評価するには厳密で労働集約的で不十分である。 AIエージェントの能力が向上するにつれて、バックドアの脅威を現実的にシミュレートし、敵の条件下でモデルのレジリエンスを評価することのできる、厳格で多様でスケーラブルな‘textit{red-teaming framework’の必要性が高まっている。本研究では,バックドア注入を自動化する一般的なフレームワークである‘textsc{AutoBackdoor} を紹介し,トリガ生成,有毒なデータ構築,および自律エージェント駆動パイプラインによるモデル微調整について述べる。従来のアプローチとは異なり、AutoBackdoorは強力な言語モデルエージェントを使用して、セマンティックに一貫性のあるコンテキスト対応のトリガーフレーズを生成する。我々は,AutoBackdoorを,幅広い攻撃をシミュレートするために,<textit{Bias Recommendation},<textit{Hallucination Injection},<textit{Peer Review Manipulation}の3つの現実的な脅威シナリオで評価した。 LLaMA-3, Mistral, Qwen, GPT-4o などのオープンソースおよび商用モデルを用いた実験により, 少量の有毒試料で90%以上の攻撃成功が得られた。さらに重要なことは、既存の防衛がこれらの攻撃を軽減できず、エージェント駆動の脅威に対するより厳格で適応的な評価技術の必要性を強調していることです。すべてのコード、データセット、実験的な設定は、https://github.com/bboylyg/BackdoorLLM.orgのメインリポジトリにマージされます。

論文の概要: AutoBackdoor: Automating Backdoor Attacks via LLM Agents

関連論文リスト