Fugu-MT 論文翻訳(概要): Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents

論文の概要: Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents

arxiv url: http://arxiv.org/abs/2605.16282v1
Date: Sat, 11 Apr 2026 04:25:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 12:34:33.852081
Title: Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents
Title（参考訳）: AIエージェントの安全性ベンチマークの分類と一貫性解析
Authors: Miles Q. Li, Benjamin C. M. Fung, Boyang Li, Heba Ismail, Farkhund Iqbal,
Abstract要約: エージェント安全ベンチマークを評価機器として用いた最初の系統解析を行った。カバレッジマトリックスは、幅広いリスクカバレッジを示すが、方法論的な収束は限定的である一方で、分類学的分析では、サンドボックス、制約付き、しばしば安全のみの評価に集中した行動的ベンチマークコアが示される。ランドスケープ全体では、ベンチマークの選択は矛盾する安全性の結論を導き、カバレッジカウントはしばしばオーバーステート評価の深さ、環境忠実度形状が報告される安全性、フィールドがエージェント内部リスクよりも外部に不均等に課せられること、メートル法フラグメンテーション制限の比較、そして効果的に不当なままである。
参考スコア（独自算出の注目度）: 6.787194586338237
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid deployment of LLM-based autonomous agents has introduced safety risks that extend far beyond traditional LLM concerns, prompting a proliferation of safety benchmarks since late 2023. However, these benchmarks have developed independently, with inconsistent threat models, incompatible metrics, and overlapping yet incomplete risk coverage. We present the first systematic analysis dedicated to agent safety benchmarks as evaluation instruments. We catalog 40 behavioral agent-safety benchmarks (2023-2026), plus 5 adjacent evaluator, defense, and dataset artifacts, propose a six-axis taxonomy of benchmark evaluation methodology, and apply it across the corpus to characterize how methodological choices shape safety conclusions. A coverage matrix reveals broad risk coverage but limited methodological convergence, while the taxonomy analysis shows a behavioral-benchmark core concentrated in sandboxed, constrained, and often safety-only evaluation. Across the landscape, we find that benchmark choice can yield contradictory safety conclusions, coverage counts often overstate evaluation depth, environment fidelity systematically shapes reported safety, the field disproportionately tests externally imposed rather than agent-internal risks, metric fragmentation limits comparison, and robustness remains effectively unbenchmarked. We ground these claims with a cross-benchmark consistency check, with 95% confidence intervals and Kendall's W concordance analysis, finding no evidence of ranking concordance across evaluation dimensions (W = 0.10, p = 0.94). We release structured metadata, full taxonomy codings, risk annotations, and all experimental artifacts, and propose minimum reporting standards for future benchmarks.
Abstract（参考訳）: LLMベースの自律エージェントの迅速な展開は、従来のLLMの懸念をはるかに越えた安全リスクを導入し、2023年後半から安全性ベンチマークが急増している。しかし、これらのベンチマークは独立して開発されており、一貫性のない脅威モデル、互換性のないメトリクス、そして重複するが不完全なリスクカバレッジがある。エージェント安全ベンチマークを評価機器として用いた最初の系統解析を行った。我々は,40の行動エージェント安全ベンチマーク (2023-2026) と,隣接する5つの評価指標,防衛,およびデータセットのアーティファクトをカタログ化し,ベンチマーク評価手法の6軸分類法を提案し,その手法の選択が安全性の結論をどう形成するかを特徴付けるために,コーパス全体に適用する。カバレッジマトリックスは、幅広いリスクカバレッジを示すが、方法論的な収束は限定的である一方で、分類学的分析では、サンドボックス、制約付き、しばしば安全のみの評価に集中した行動的ベンチマークコアが示される。ランドスケープ全体では、ベンチマークの選択は矛盾する安全性の結論を導き、カバレッジカウントはしばしばオーバーステート評価の深さ、環境忠実度が報告された安全性を体系的に形作ること、フィールドがエージェント内部リスクよりも外部で不均等に課されること、メートル法的なフラグメンテーション限界の比較、ロバストネスが効果的に損なわれないことが判明した。我々はこれらの主張を、95%の信頼区間とKendall's W Concordance Analysis(W = 0.10, p = 0.94)を持つクロスベンチマーク整合性チェックで裏付け、評価次元間でのランク一致の証拠は見つからない(W = 0.10, p = 0.94)。構造化メタデータ、完全な分類符号、リスクアノテーション、およびすべての実験成果物をリリースし、将来のベンチマークの最小レポート標準を提案します。

論文の概要: Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents

関連論文リスト