Fugu-MT 論文翻訳(概要): Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contract Security?

論文の概要: Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contract Security?

arxiv url: http://arxiv.org/abs/2603.10795v1
Date: Wed, 11 Mar 2026 14:07:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-12 16:22:32.983823
Title: Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contract Security?
Title（参考訳）: EVMBenchの再評価: AIエージェントはスマートコントラクトセキュリティに対応しているか?
Authors: Chaoyuan Peng, Lei Wu, Yajin Zhou,
Abstract要約: EVMbenchは、スマートコントラクトセキュリティに関するAIエージェントのための最初の大規模なベンチマークである。その成果は、完全に自動化されたAI監査が到達範囲内にあるという期待を後押しした。これらの発見は、完全に自動化されたAI監査が差し迫っているという物語に挑戦する。
参考スコア（独自算出の注目度）: 10.248746359119625
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: EVMbench, released by OpenAI, Paradigm, and OtterSec, is the first large-scale benchmark for AI agents on smart contract security. Its results -- agents detect up to 45.6% of vulnerabilities and exploit 72.2% of a curated subset -- have fueled expectations that fully automated AI auditing is within reach. We identify two limitations: its narrow evaluation scope (14 agent configurations, most models tested on only their vendor scaffold) and its reliance on audit-contest data published before every model's release that models may have seen during training. To address these, we expand to 26 configurations across four model families and three scaffolds, and introduce a contamination-free dataset of 22 real-world security incidents postdating every model's release date. Our evaluation yields three findings: (1) agents' detection results are not stable, with rankings shifting across configurations, tasks, and datasets; (2) on real-world incidents, no agent succeeds at end-to-end exploitation across all 110 agent-incident pairs despite detecting up to 65% of vulnerabilities, contradicting EVMbench's conclusion that discovery is the primary bottleneck; and (3) scaffolding materially affects results, with an open-source scaffold outperforming vendor alternatives by up to 5 percentage points, yet EVMbench does not control for this. These findings challenge the narrative that fully automated AI auditing is imminent. Agents reliably catch well-known patterns and respond strongly to human-provided context, but cannot replace human judgment. For developers, agent scans serve as a pre-deployment check. For audit firms, agents are most effective within a human-in-the-loop workflow where AI handles breadth and human auditors contribute protocol-specific knowledge and adversarial reasoning. Code and data: https://github.com/blocksecteam/ReEVMBench/.
Abstract（参考訳）: OpenAI、Paradigm、OtterSecがリリースしたEVMbenchは、スマートコントラクトセキュリティに関するAIエージェントのための最初の大規模ベンチマークである。エージェントは最大45.6%の脆弱性を検出し、キュレートされたサブセットの72.2%を悪用している。その限定的な評価範囲(14のエージェント構成、ほとんどのモデルはベンダーの足場でのみテストされる)と、トレーニング中にモデルが見たであろうすべてのモデルのリリース前に公開された監査-テストデータへの依存です。これらの問題を解決するために、4つのモデルファミリと3つの足場にわたる26の構成に拡張し、すべてのモデルのリリース日を延ばした22の現実世界のセキュリティインシデントによる汚染のないデータセットを導入しました。評価の結果は,(1)エージェントの検出結果が安定せず,構成,タスク,データセットのランクが変動する,(2)現実のインシデントにおいて,最大65%の脆弱性を検出しながら,110のエージェントインシデントペアに対してエンドツーエンドのエクスプロイトを成功させるエージェントが存在しない,(3)発見が主要なボトルネックである,というEVMbenchの結論に反する,(3)足場が大きな影響を受け,オープンソースの足場がベンダーの代替品を最大5ポイント上回る,という,3つの結果を得た。これらの発見は、完全に自動化されたAI監査が差し迫っているという物語に挑戦する。エージェントはよく知られたパターンを確実に捉え、人間が提供する文脈に強く反応するが、人間の判断に取って代わることはできない。開発者にとっては、エージェントスキャンはデプロイ前チェックとして機能する。監査会社にとって、エージェントは、AIが広範囲を処理し、人間の監査官がプロトコル固有の知識と敵対的推論に貢献する、ループ内の人間ワークフローにおいて最も効果的である。コードとデータ:https://github.com/blocksecteam/ReEVMBench/。

論文の概要: Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contract Security?

関連論文リスト