Fugu-MT 論文翻訳(概要): STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents

論文の概要: STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents

arxiv url: http://arxiv.org/abs/2509.25624v1
Date: Tue, 30 Sep 2025 00:31:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 17:09:04.375376
Title: STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents
Title（参考訳）: STAC:LLMエージェントを脱獄させる危険な鎖を作るツール
Authors: Jing-Jing Li, Jianfeng He, Chao Shang, Devang Kulshreshtha, Xun Xian, Yi Zhang, Hang Su, Sandesh Swamy, Yanjun Qi,
Abstract要約: 本稿では,エージェントツールの利用を生かした新しいマルチターンアタックフレームワークSTACについて紹介する。我々は,483のSTACケースを自動生成し,評価するために,1,352セットのユーザエージェント環境相互作用を特徴とするフレームワークを適用した。 GPT-4.1を含む最先端のLSMエージェントはSTACに対して極めて脆弱であり,攻撃成功率(ASR)は90%以上である。
参考スコア（独自算出の注目度）: 38.755035623707656
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As LLMs advance into autonomous agents with tool-use capabilities, they introduce security challenges that extend beyond traditional content-based LLM safety concerns. This paper introduces Sequential Tool Attack Chaining (STAC), a novel multi-turn attack framework that exploits agent tool use. STAC chains together tool calls that each appear harmless in isolation but, when combined, collectively enable harmful operations that only become apparent at the final execution step. We apply our framework to automatically generate and systematically evaluate 483 STAC cases, featuring 1,352 sets of user-agent-environment interactions and spanning diverse domains, tasks, agent types, and 10 failure modes. Our evaluations show that state-of-the-art LLM agents, including GPT-4.1, are highly vulnerable to STAC, with attack success rates (ASR) exceeding 90% in most cases. The core design of STAC's automated framework is a closed-loop pipeline that synthesizes executable multi-step tool chains, validates them through in-environment execution, and reverse-engineers stealthy multi-turn prompts that reliably induce agents to execute the verified malicious sequence. We further perform defense analysis against STAC and find that existing prompt-based defenses provide limited protection. To address this gap, we propose a new reasoning-driven defense prompt that achieves far stronger protection, cutting ASR by up to 28.8%. These results highlight a crucial gap: defending tool-enabled agents requires reasoning over entire action sequences and their cumulative effects, rather than evaluating isolated prompts or responses.
Abstract（参考訳）: LLMがツール使用能力を持つ自律エージェントに進化するにつれて、従来のコンテンツベースのLLM安全性の懸念を超えて、セキュリティ上の課題が導入される。本稿では,エージェントツールの利用を生かした新しいマルチターンアタックフレームワークSTACについて紹介する。 STACはツールコールをまとめて、それぞれが独立して無害に見えるようにしますが、組み合わせると、最終実行段階でのみ明らかになる有害な操作を集合的に有効にします。我々は,483のSTACケースを自動生成・体系的に評価し,ユーザエージェントと環境のインタラクションを1,352セット行い,多様なドメイン,タスク,エージェントタイプ,10の障害モードを網羅するフレームワークを適用した。 GPT-4.1を含む最先端のLSMエージェントはSTACに対して極めて脆弱であり,攻撃成功率(ASR)は90%以上である。 STACの自動化フレームワークの中核となる設計はクローズドループパイプラインであり、実行可能マルチステップツールチェーンを合成し、環境内実行を通じて検証し、リバースエンジニアリングのステルスシーなマルチターンプロンプトはエージェントに確実に不正なシーケンスの実行を誘導する。さらにSTACに対する防御分析を行い、既存のプロンプトベースの防御が限定的な保護を提供することを示した。このギャップに対処するため、我々は、ASRを最大28.8%削減する、はるかに強力な保護を実現する新しい推論駆動防衛プロンプトを提案する。ツール対応エージェントの防衛には、独立したプロンプトや応答を評価するのではなく、アクションシーケンス全体とその累積効果を推論する必要がある。

論文の概要: STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents

関連論文リスト