Related papers: SmartPoC: Generating Executable and Validated PoCs for Smart Contract Bug Reports

SmartPoC: Generating Executable and Validated PoCs for Smart Contract Bug Reports

URL: http://arxiv.org/abs/2511.12993v2
Date: Mon, 24 Nov 2025 11:08:48 GMT
Title: SmartPoC: Generating Executable and Validated PoCs for Smart Contract Bug Reports
Authors: Longfei Chen, Ruibin Yan, Taiyu Wong, Yiyang Chen, Chao Zhang,
Abstract summary: SmartPoC is an automated framework that converts textual audit reports into validated test cases.<n>SmartPoC confirms 236 real bugs out of 545 audit findings at a cost of only $0.03 per finding.
Score: 12.959714248490506
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Smart contracts are prone to vulnerabilities and are analyzed by experts as well as automated systems, such as static analysis and AI-assisted solutions. However, audit artifacts are heterogeneous and often lack reproducible, executable PoC tests suitable for automated validation, leading to costly, ad hoc manual verification. Large language models (LLMs) can be leveraged to turn audit reports into PoC test cases, but have three major challenges: noisy inputs, hallucinations, and missing runtime oracles. In this paper, we present SmartPoC, an automated framework that converts textual audit reports into executable, validated test cases. First, the input audit report is processed to reduce noise, and only bug-related functions are extracted and fed to LLMs as context. To curb hallucinations and ensure compile-and-run readiness, we leverage LLMs to synthesize PoC test cases with specially-designed pre-/post-execution repair. We further utilize differential verification as oracles to confirm exploitability of the PoC test cases. On the SmartBugs-Vul and FORGE-Vul benchmarks, SmartPoC generates executable, validated Foundry test cases for 85.61% and 86.45% of targets, respectively. Applied to the latest Etherscan verified-source corpus, SmartPoC confirms 236 real bugs out of 545 audit findings at a cost of only $0.03 per finding.

Related papers

IMMACULATE: A Practical LLM Auditing Framework via Verifiable Computation [49.796717294455796]
We present IMMACULATE, a practical auditing framework that detects economically motivated deviations.<n>IMMACULATE selectively audits a small fraction of requests using verifiable computation, achieving strong detection guarantees while amortizing cryptographic overhead.
arXiv Detail & Related papers (2026-02-26T07:21:02Z)
Scaling Agentic Verifier for Competitive Coding [66.11758166379092]
Large language models (LLMs) have demonstrated strong coding capabilities but still struggle to solve competitive programming problems correctly in a single attempt.<n>Execution-based re-ranking offers a promising test-time scaling strategy, yet existing methods are constrained by either difficult test case generation or inefficient random input sampling.<n>We propose Agentic Verifier, an execution-based agent that actively reasons about program behaviors and searches for highly discriminative test inputs.
arXiv Detail & Related papers (2026-02-04T06:30:40Z)
PoCo: Agentic Proof-of-Concept Exploit Generation for Smart Contracts [4.837987507203078]
We introduce POCO, an agentic framework that automatically generates executable proof-of-concept exploits.<n>PoCO generates exploits in an agentic manner by interacting with a set of code-execution tools in a Reason-Act-Observe loop.<n>We evaluate POCO on a dataset of 23 real-world vulnerability reports.
arXiv Detail & Related papers (2025-11-04T18:03:12Z)
ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases [58.411135609139855]
"Shortcuts" to complete tasks pose significant risks for reliable assessment and deployment of large language models.<n>We introduce ImpossibleBench, a benchmark framework that measures LLM agents' propensity to exploit test cases.<n>As a practical framework, ImpossibleBench is not just an evaluation but a versatile tool.
arXiv Detail & Related papers (2025-10-23T06:58:32Z)
Validating Solidity Code Defects using Symbolic and Concrete Execution powered by Large Language Models [0.0]
This paper introduces a novel detection pipeline that integrates custom Slither-based detectors, Large Language Models (LLMs), Kontrol, and Forge.<n>Our approach is designed to reliably detect defects and generate proofs.
arXiv Detail & Related papers (2025-09-16T12:46:11Z)
An Automated Blackbox Noncompliance Checker for QUIC Server Implementations [2.9248916859490173]
QUICtester is an automated approach for uncovering non-compliant behaviors in the ratified QUIC protocol implementations (RFC 9000/).<n>We used QUICtester to analyze 186 learned models from 19 QUIC implementations under the five security settings and discovered 55 implementation errors.
arXiv Detail & Related papers (2025-05-19T04:28:49Z)
Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs [71.7892165868749]
Commercial Large Language Model (LLM) APIs create a fundamental trust problem.<n>Users pay for specific models but have no guarantee that providers deliver them faithfully.<n>We formalize this model substitution problem and evaluate detection methods under realistic adversarial conditions.<n>We propose and evaluate the use of Trusted Execution Environments (TEEs) as one practical and robust solution.
arXiv Detail & Related papers (2025-04-07T03:57:41Z)
CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases.<n>The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z)
RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing [8.846583362353169]
RepoAudit is an autonomous repository-level code auditing agent.<n>It detects 40 true bugs across 15 real-world benchmark projects with a precision of 78.43%.<n>Also, it detects 185 new bugs in high-profile projects, among which 174 have been confirmed or fixed.
arXiv Detail & Related papers (2025-01-30T05:56:30Z)
GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data. We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch. Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z)
Automatic Generation of Test Cases based on Bug Reports: a Feasibility Study with Large Language Models [4.318319522015101]
Existing approaches produce test cases that either can be qualified as simple (e.g. unit tests) or that require precise specifications. Most testing procedures still rely on test cases written by humans to form test suites. We investigate the feasibility of performing this generation by leveraging large language models (LLMs) and using bug reports as inputs.
arXiv Detail & Related papers (2023-10-10T05:30:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.