Fugu-MT 論文翻訳(概要): Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks

論文の概要: Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks

arxiv url: http://arxiv.org/abs/2510.01359v1
Date: Wed, 01 Oct 2025 18:38:20 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:20.825485
Title: Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks
Title（参考訳）: AIコードエージェントのセキュリティアセスメント: 組織的脱獄攻撃によるセキュリティアセスメント
Authors: Shoumik Saha, Jifan Chen, Sam Mayers, Sanjay Krishna Gouda, Zijian Wang, Varun Kumar,
Abstract要約: コード対応の大規模言語モデル(LLM)エージェントはソフトウェア工学に組み込まれ、コードを読み、書き、実行することができる。 JAWS-BENCHは、3つのエスカレーションワークスペースにまたがるベンチマークであり、攻撃能力を反映している。 JAWS-0のプロンプトのみの条件下では、コードエージェントは平均して61%の攻撃を受けており、58%が有害、52%がパース、27%がエンドツーエンドで実行される。
参考スコア（独自算出の注目度）: 11.371490212283383
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Code-capable large language model (LLM) agents are increasingly embedded into software engineering workflows where they can read, write, and execute code, raising the stakes of safety-bypass ("jailbreak") attacks beyond text-only settings. Prior evaluations emphasize refusal or harmful-text detection, leaving open whether agents actually compile and run malicious programs. We present JAWS-BENCH (Jailbreaks Across WorkSpaces), a benchmark spanning three escalating workspace regimes that mirror attacker capability: empty (JAWS-0), single-file (JAWS-1), and multi-file (JAWS-M). We pair this with a hierarchical, executable-aware Judge Framework that tests (i) compliance, (ii) attack success, (iii) syntactic correctness, and (iv) runtime executability, moving beyond refusal to measure deployable harm. Using seven LLMs from five families as backends, we find that under prompt-only conditions in JAWS-0, code agents accept 61% of attacks on average; 58% are harmful, 52% parse, and 27% run end-to-end. Moving to single-file regime in JAWS-1 drives compliance to ~ 100% for capable models and yields a mean ASR (Attack Success Rate) ~ 71%; the multi-file regime (JAWS-M) raises mean ASR to ~ 75%, with 32% instantly deployable attack code. Across models, wrapping an LLM in an agent substantially increases vulnerability -- ASR raises by 1.6x -- because initial refusals are frequently overturned during later planning/tool-use steps. Category-level analyses identify which attack classes are most vulnerable and most readily deployable, while others exhibit large execution gaps. These findings motivate execution-aware defenses, code-contextual safety filters, and mechanisms that preserve refusal decisions throughout the agent's multi-step reasoning and tool use.
Abstract（参考訳）: コード対応の大規模言語モデル(LLM)エージェントは、コードの読み込み、書き込み、実行が可能なソフトウェアエンジニアリングワークフローにますます組み込まれており、テキストのみの設定を越えて、セーフティバイパス("jailbreak")攻撃の負担を増大させている。以前の評価では、エージェントが実際に悪意あるプログラムをコンパイルして実行しているかどうかをオープンにし、拒絶または有害なテキスト検出を強調していた。これは、空(JAWS-0)、シングルファイル(JAWS-1)、マルチファイル(JAWS-M)の3つのエスカレートされたワークスペースレシスタンスにまたがるベンチマークである。これを階層的で実行可能な、テスト可能な判断フレームワークと組み合わせます。 (i)コンプライアンス (二)攻撃成功。 (三)統語的正しさ、及び (4) 実行時の実行可能性。 JAWS-0のプロンプトのみの条件下では、コードエージェントは平均して61%の攻撃を受けており、58%が有害、52%がパース、27%がエンドツーエンドで実行される。 JAWS-1の単一ファイル体制への移行は、有能なモデルに対して約100%のコンプライアンスを駆動し、平均的なASR (Attack Success Rate) ~71%を得る。モデル全体では、エージェントにLLMをラップすることで脆弱性が大幅に増加し、ASRは1.6倍上昇する。カテゴリレベルの分析では、どの攻撃クラスが最も脆弱で、最も容易にデプロイ可能かを特定し、他のクラスは大きな実行ギャップを示す。これらの知見は、エージェントの多段階推論とツール使用全体にわたって、実行対応の防御、コードコンテキスト安全フィルタ、拒絶決定を保存するメカニズムを動機付けている。

論文の概要: Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks

関連論文リスト