EVMbench: Evaluating AI Agents on Smart Contract Security
- URL: http://arxiv.org/abs/2603.04915v1
- Date: Thu, 05 Mar 2026 07:59:14 GMT
- Title: EVMbench: Evaluating AI Agents on Smart Contract Security
- Authors: Justin Wang, Andreas Bigger, Xiaohai Xu, Justin W. Lin, Andy Applebaum, Tejal Patwardhan, Alpin Yukseloglu, Olivia Watkins,
- Abstract summary: EVMbench is an evaluation that measures the ability of agents to detect, patch, and exploit smart contract vulnerabilities.<n>We evaluate a range of frontier agents and find that they are capable of discovering and exploiting end-to-end vulnerabilities against live blockchain instances.
- Score: 9.254733807577242
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Smart contracts on public blockchains now manage large amounts of value, and vulnerabilities in these systems can lead to substantial losses. As AI agents become more capable at reading, writing, and running code, it is natural to ask how well they can already navigate this landscape, both in ways that improve security and in ways that might increase risk. We introduce EVMbench, an evaluation that measures the ability of agents to detect, patch, and exploit smart contract vulnerabilities. EVMbench draws on 117 curated vulnerabilities from 40 repositories and, in the most realistic setting, uses programmatic grading based on tests and blockchain state under a local Ethereum execution environment. We evaluate a range of frontier agents and find that they are capable of discovering and exploiting vulnerabilities end-to-end against live blockchain instances. We release code, tasks, and tooling to support continued measurement of these capabilities and future work on security.
Related papers
- OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows [77.95511352806261]
Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms.<n>We propose OS-Sentinel, a novel hybrid safety detection framework that combines a Formal Verifier for detecting explicit system-level violations with a Contextual Judge for assessing contextual risks and agent actions.
arXiv Detail & Related papers (2025-10-28T13:22:39Z) - RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents [70.24175620901538]
Code agents have gained widespread adoption due to their strong code generation capabilities and integration with code interpreters.<n>Current static safety benchmarks and red-teaming tools are inadequate for identifying emerging real-world risky scenarios.<n>We propose RedCodeAgent, the first automated red-teaming agent designed to systematically uncover vulnerabilities in diverse code agents.
arXiv Detail & Related papers (2025-10-02T22:59:06Z) - Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition [101.86739402748995]
We run the largest public red-teaming competition to date, targeting 22 frontier AI agents across 44 realistic deployment scenarios.<n>We build the Agent Red Teaming benchmark and evaluate it across 19 state-of-the-art models.<n>Our findings highlight critical and persistent vulnerabilities in today's AI agents.
arXiv Detail & Related papers (2025-07-28T05:13:04Z) - OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety [58.201189860217724]
We introduce OpenAgentSafety, a comprehensive framework for evaluating agent behavior across eight critical risk categories.<n>Unlike prior work, our framework evaluates agents that interact with real tools, including web browsers, code execution environments, file systems, bash shells, and messaging platforms.<n>It combines rule-based analysis with LLM-as-judge assessments to detect both overt and subtle unsafe behaviors.
arXiv Detail & Related papers (2025-07-08T16:18:54Z) - Decompiling Smart Contracts with a Large Language Model [51.49197239479266]
Despite Etherscan's 78,047,845 smart contracts deployed on (as of May 26, 2025), a mere 767,520 ( 1%) are open source.<n>This opacity necessitates the automated semantic analysis of on-chain smart contract bytecode.<n>We introduce a pioneering decompilation pipeline that transforms bytecode into human-readable and semantically faithful Solidity code.
arXiv Detail & Related papers (2025-06-24T13:42:59Z) - Ai-Driven Vulnerability Analysis in Smart Contracts: Trends, Challenges and Future Directions [0.2797210504706914]
Vulnerabilities such as numerical overflows, reentrancy attacks, and improper access permissions have led to the loss of millions of dollars.<n>Traditional smart contract auditing techniques face limitations in scalability, automation, and adaptability to evolving development patterns.<n>This paper examines novel AI-driven techniques for vulnerability detection in smart contracts, focusing on machine learning, deep learning, graph neural networks, and transformer-based models.
arXiv Detail & Related papers (2025-06-07T09:44:26Z) - A Comprehensive Study of Exploitable Patterns in Smart Contracts: From Vulnerability to Defense [1.1138859624936408]
Vulnerabilities within smart contracts not only undermine the security of individual applications but also pose significant risks to the broader blockchain ecosystem.<n>This paper provides a comprehensive analysis of key security risks in smart contracts, specifically those written in Solidity and executed on the Virtual Machine.<n>We focus on two prevalent and critical types (reentrancy and integer overflow) by examining their underlying mechanisms, replicating attack scenarios, and assessing effective countermeasures.
arXiv Detail & Related papers (2025-04-30T10:00:36Z) - Insecurity Through Obscurity: Veiled Vulnerabilities in Closed-Source Contracts [11.609699771118116]
We present SKANF, a novel bytecode analysis tool tailored for closed-source and obfuscated contracts.<n>SKANF combines control-flow deobfuscation, symbolic execution, and concolic execution based on historical transactions to identify and exploit asset management vulnerabilities.<n>Our evaluation on real-world Maximal Extractable Value (MEV) bots reveals that SKANF detects vulnerabilities in 1,030 contracts and successfully generates exploits for 394 of them, with potential losses of $10.6M.
arXiv Detail & Related papers (2025-04-18T01:22:58Z) - Vulnerability anti-patterns in Solidity: Increasing smart contracts security by reducing false alarms [0.0]
We show how integrating and extending current analyses is not only feasible, but also a next logical step in smart-contract security.
We propose light-weight static checks on the morphology and dynamics of Solidity code, stemming from a developer-centric notion of vulnerability.
arXiv Detail & Related papers (2024-10-22T17:21:28Z) - An Automated Vulnerability Detection Framework for Smart Contracts [18.758795474791427]
We propose a framework to automatically detect vulnerabilities in smart contracts on the blockchain.
More specifically, first, we utilize novel feature vector generation techniques from bytecode of smart contract.
Next, the collected vectors are fed into our novel metric learning-based deep neural network(DNN) to get the detection result.
arXiv Detail & Related papers (2023-01-20T23:16:04Z) - ESCORT: Ethereum Smart COntRacTs Vulnerability Detection using Deep
Neural Network and Transfer Learning [80.85273827468063]
Existing machine learning-based vulnerability detection methods are limited and only inspect whether the smart contract is vulnerable.
We propose ESCORT, the first Deep Neural Network (DNN)-based vulnerability detection framework for smart contracts.
We show that ESCORT achieves an average F1-score of 95% on six vulnerability types and the detection time is 0.02 seconds per contract.
arXiv Detail & Related papers (2021-03-23T15:04:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.