Fugu-MT 論文翻訳(概要): Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

論文の概要: Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

arxiv url: http://arxiv.org/abs/2605.22643v2
Date: Fri, 22 May 2026 14:53:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 14:44:53.786897
Title: Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
Title（参考訳）: フロッグの沸騰: エージェント安全のためのマルチTurnベンチマーク
Authors: Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Federico Sartore, Enrico Panai, Laura Caroli, Yue Zhu, Adam Leon Smith, Luca Nannini, Marcello Galisai, Susanna Cifani, Francesco Giarrusso, Marcantonio Bracale Syrnikov, Daniele Nardi,
Abstract要約: ツール使用のAIモデルがインクリメンタルアタックの影響を受けやすいかどうかを評価するベンチマークであるBoiling the Frogを紹介した。シナリオは3段階の運用リスク分類によって編成される。 9モデルパネル全体では、総攻撃成功率(ASR)は44.4%である。
参考スコア（独自算出の注目度）: 2.661610409070365
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the safety-relevant object shifts from what the system says to what it does within an environment, and evaluating model responses under prompting is no longer sufficient to address the safety challenges posed by artificial intelligence. Recent developments have seen the rise of benchmarks that evaluate large language models as agents. We contribute to this strand of research. Approach. We introduce Boiling the Frog, a benchmark that evaluates whether tool-using AI models deployed in corporate and office settings are susceptible to incremental attacks. Each scenario begins with benign workspace edits and later introduces a risk-bearing request. The benchmark focuses on stateful multi-turn evaluation: chains expose a persistent workspace, place the risk-bearing payload at controlled positions in the turn sequence, and score whether the resulting artifact state becomes unsafe. Scenarios are organized through a three-level operational risk taxonomy grounded in the Boiling the Frog risks, the AI Act Annex I and Annex III high-risk contexts, and EU AI Act's Code of Practice on General-Purpose AI (GPAI). Results. Across a nine-model panel, aggregate strict attack success rate (ASR) is 44.4%. Model-level ASR ranges from 20.5% for Claude Haiku 4.5 to 92.9% for Gemini 3.1 Flash Lite, with Seed 2.0 Lite also above 80%. Average chain category-level ASR reaches 93.3% for Code of Practice loss-of-control scenarios.
Abstract（参考訳）: 背景。言語モデルの伝統的な安全ベンチマークは、生成されたテキストを評価する:モデルが有害な言語を出力するか、バイアスを再現するか、有害な指示に従うか。モデルがエージェントとしてデプロイされると、安全関連オブジェクトはシステムが言うものから環境内で行うものへとシフトし、プロンプトによるモデル応答の評価は、人工知能がもたらす安全上の課題に対処するのに十分ではない。近年,大規模言語モデルをエージェントとして評価するベンチマークが増加している。私たちはこの研究に貢献する。アプローチ。企業やオフィスにデプロイされたツール使用AIモデルがインクリメンタルアタックの影響を受けやすいかどうかを評価するベンチマークであるBoiling the Frogを紹介した。それぞれのシナリオは、良心的なワークスペースの編集から始まり、後にリスクを負う要求を導入する。チェーンは永続的なワークスペースを公開し、リスクを負うペイロードをターンシーケンス内の制御された位置に配置し、結果のアーティファクト状態が安全でないかどうかを評価する。シナリオは、フロッグリスクのボイリング、AI法Annex IとAnnex IIIのハイリスクコンテキスト、EU AI Actの一般目的AI(GPAI)に関するプラクティス(Code of Practice on General-Purpose AI)の3段階の運用リスク分類によって編成される。結果。 9モデルパネル全体では、総攻撃成功率(ASR)は44.4%である。モデルレベルの ASR は Claude Haiku 4.5 の 20.5% から Gemini 3.1 Flash Lite の92.9% まで、Seed 2.0 Lite も80%以上である。平均チェーンのカテゴリレベルのASRは、コード・オブ・プラクティス・オブ・コントロールのシナリオで93.3%に達する。

論文の概要: Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

関連論文リスト