Related papers: LLMs as verification oracles for Solidity

LLMs as verification oracles for Solidity

URL: http://arxiv.org/abs/2509.19153v1
Date: Tue, 23 Sep 2025 15:32:13 GMT
Title: LLMs as verification oracles for Solidity
Authors: Massimo Bartoletti, Enrico Lipparini, Livio Pompianu,
Abstract summary: This paper provides the first systematic evaluation of GPT-5, a state-of-the-art reasoning LLM, in this role.<n>We benchmark its performance on a large dataset of verification tasks, compare its outputs against those of established formal verification tools, and assess its practical effectiveness in real-world auditing scenarios.<n>Our study suggests a new frontier in the convergence of AI and formal methods for secure smart contract development and auditing.
Score: 1.3887048755037537
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Ensuring the correctness of smart contracts is critical, as even subtle flaws can lead to severe financial losses. While bug detection tools able to spot common vulnerability patterns can serve as a first line of defense, most real-world exploits and losses stem from errors in the contract business logic. Formal verification tools such as SolCMC and the Certora Prover address this challenge, but their impact remains limited by steep learning curves and restricted specification languages. Recent works have begun to explore the use of large language models (LLMs) for security-related tasks such as vulnerability detection and test generation. Yet, a fundamental question remains open: can LLMs serve as verification oracles, capable of reasoning about arbitrary contract-specific properties? In this paper, we provide the first systematic evaluation of GPT-5, a state-of-the-art reasoning LLM, in this role. We benchmark its performance on a large dataset of verification tasks, compare its outputs against those of established formal verification tools, and assess its practical effectiveness in real-world auditing scenarios. Our study combines quantitative metrics with qualitative analysis, and shows that recent reasoning-oriented LLMs can be surprisingly effective as verification oracles, suggesting a new frontier in the convergence of AI and formal methods for secure smart contract development and auditing.

Related papers

Measuring what Matters: Construct Validity in Large Language Model Benchmarks [103.53142193393931]
evaluating large language models (LLMs) is crucial for both assessing their capabilities and identifying safety or robustness issues prior to deployment.<n>We conduct a systematic review of 445 benchmarks from leading conferences in natural language processing and machine learning.<n>We find patterns related to the measured phenomena, tasks, and scoring metrics which undermine the validity of the resulting claims.
arXiv Detail & Related papers (2025-11-03T17:39:40Z)
Better Call CLAUSE: A Discrepancy Benchmark for Auditing LLMs Legal Reasoning Capabilities [15.35489310097019]
CLAUSE is a first-of-its-kind benchmark designed to evaluate the fragility of an LLM's legal reasoning.<n>Our work outlines a path to identify and correct such reasoning failures in legal AI.
arXiv Detail & Related papers (2025-11-01T00:51:21Z)
ParaVul: A Parallel Large Language Model and Retrieval-Augmented Framework for Smart Contract Vulnerability Detection [43.41293570032631]
ParaVul is a retrieval-augmented framework to improve the reliability and accuracy of smart contract vulnerability detection.<n>We develop Sparse Low-Rank Adaptation (SLoRA) for LLM fine-tuning.<n>We construct a vulnerability contract dataset and develop a hybrid Retrieval-Augmented Generation (RAG) system.
arXiv Detail & Related papers (2025-10-20T03:23:41Z)
The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives [8.030821324147515]
Inverse Reinforcement Learning can infer reward functions from behaviour.<n>Existing approaches either produce a single, overconfident reward estimate or fail to address the fundamental ambiguity of the task.<n>This paper introduces a principled auditing framework that re-frames reward inference from a simple estimation task to a comprehensive process for verification.
arXiv Detail & Related papers (2025-10-07T16:25:14Z)
Validating Solidity Code Defects using Symbolic and Concrete Execution powered by Large Language Models [0.0]
This paper introduces a novel detection pipeline that integrates custom Slither-based detectors, Large Language Models (LLMs), Kontrol, and Forge.<n>Our approach is designed to reliably detect defects and generate proofs.
arXiv Detail & Related papers (2025-09-16T12:46:11Z)
Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z)
Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs [60.881609323604685]
Large Language Models (LLMs) accessed via black-box APIs introduce a trust challenge.<n>Users pay for services based on advertised model capabilities.<n> providers may covertly substitute the specified model with a cheaper, lower-quality alternative to reduce operational costs.<n>This lack of transparency undermines fairness, erodes trust, and complicates reliable benchmarking.
arXiv Detail & Related papers (2025-04-07T03:57:41Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z)
Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. Our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z)
An Insight into Security Code Review with LLMs: Capabilities, Obstacles, and Influential Factors [9.309745288471374]
Security code review is a time-consuming and labor-intensive process.<n>Existing security analysis tools struggle with poor generalization, high false positive rates, and coarse detection granularity.<n>Large Language Models (LLMs) have been considered promising candidates for addressing those challenges.
arXiv Detail & Related papers (2024-01-29T17:13:44Z)
Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities [12.82645410161464]
We evaluate the effectiveness of 16 pre-trained Large Language Models on 5,000 code samples from five diverse security datasets. Overall, LLMs show modest effectiveness in detecting vulnerabilities, obtaining an average accuracy of 62.8% and F1 score of 0.71 across datasets. We find that advanced prompting strategies that involve step-by-step analysis significantly improve performance of LLMs on real-world datasets in terms of F1 score (by upto 0.18 on average)
arXiv Detail & Related papers (2023-11-16T13:17:20Z)
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z)
Large Language Model-Powered Smart Contract Vulnerability Detection: New Perspectives [8.524720028421447]
This paper provides a systematic analysis of the opportunities, challenges, and potential solutions of harnessing Large Language Models (LLMs) such as GPT-4. generating more answers with higher randomness largely boosts the likelihood of producing a correct answer but inevitably leads to a higher number of false positives. We propose an adversarial framework dubbed GPTLens that breaks the conventional one-stage detection into two synergistic stages $-$ generation and discrimination.
arXiv Detail & Related papers (2023-10-02T12:37:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.