Fugu-MT 論文翻訳(概要): The Best-Laid SCHEMEs: Coordinated Sabotage and Monitoring in Multi-Agent Systems

論文の概要: The Best-Laid SCHEMEs: Coordinated Sabotage and Monitoring in Multi-Agent Systems

arxiv url: http://arxiv.org/abs/2605.29178v1
Date: Wed, 27 May 2026 23:30:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:55.558774
Title: The Best-Laid SCHEMEs: Coordinated Sabotage and Monitoring in Multi-Agent Systems
Title（参考訳）: The Best-Laid SCHEMEs:Coordinated Sabotage and Monitoring in Multi-Agent Systems
Authors: Nikolay Radev, Lennart Haas, Benjamin Arnav, Pablo Bernabeu-Pérez,
Abstract要約: SCHEMEは7つの設定と8つの実際のオープンソースライブラリにわたる17のタスクインスタンスのベンチマークである。各設定は、エージェントの適切なサブセットが単独で成功しないように設計されている。 GPT 5.1 Codex と Gemini 3.1 Pro ですでに協調サボタージュが実用化されていることを示す。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As agentic coding systems decompose work across multiple model instances, a critical safety question is whether those instances can coordinate to achieve a hidden malicious objective while remaining aligned with user intent. We introduce SCHEME, a benchmark of 17 task instances across 7 settings and 8 real open-source libraries, each pairing a legitimate software-engineering task with a covert side task. Every setting is designed so that no proper subset of agents can succeed alone: agents must decompose a shared sabotage plan, relay partial requirements under different communication topologies, and execute mutually consistent edits, testing genuine multi-agent coordination rather than individual capability. Evaluating with GPT 5.1 Codex and Gemini 3.1 Pro, we find coordinated sabotage is already practical, with Gemini completing the covert objective while succeeding on the legitimate task in 84\% of samples and Codex in 46\%. A failure analysis shows that recovery, not failure incidence, drives the model gap: both models suffer similar numbers of communication failures per trajectory, but Gemini recovers from $\sim$73\% of occurrences against Codex's $\sim$56\%. However, the agents' code edits remain detectable: a trusted monitor flags coordinated sabotage at FPR$\le$2\% in 99\%/68\% (Gemini/Codex) of attempts from code edits alone, rising to 100\%/81\% with access to agent communication. Despite stealth-aware prompting, undetected sabotage stays at $<$3\% under the full-visibility monitor; telling agents a monitor is present halves Gemini's sabotage success without significantly breaking detection, even when agents are explicitly prompted to fabricate cover stories.
Abstract（参考訳）: エージェント型コーディングシステムは複数のモデルインスタンスにまたがって処理を分解するので、重要な安全上の問題は、それらのインスタンスがユーザ意図に合わせたまま、隠れた悪意のある目的を達成するために調整できるかどうかである。 7つの設定にまたがる17のタスクインスタンスと8つの実際のオープンソースライブラリのベンチマークであるSCHEMEを紹介します。エージェントは共有サボタージュ計画を分解し、異なる通信トポロジの下で部分的な要求を中継し、相互に一貫した編集を実行し、個々の能力よりも真のマルチエージェント調整をテストする。 GPT 5.1 Codex と Gemini 3.1 Pro で評価すると、協調サボタージュはすでに実用的であり、Gemini は秘密の目的を達成し、84 % のサンプルと 46 % のコーデックスで正規のタスクに成功している。しかし、GeminiはCodexの$\sim$56\%に対して、$\sim$73\%から回復する。信頼できるモニターフラグは、FPR$\le$2\%で調整されたサボタージュ 99\%/68\% (Gemini/Codex)で、エージェント通信へのアクセスによって100\%/81\%まで上昇する。監視役はジェミニのサボタージュの成功を著しく損なうことなく、たとえエージェントが明らかにカバーストーリーを製作するよう促されたとしても、監視役はジェミニのサボタージュの成功を半減させる。

論文の概要: The Best-Laid SCHEMEs: Coordinated Sabotage and Monitoring in Multi-Agent Systems

関連論文リスト