VeruSAGE: A Study of Agent-Based Verification for Rust Systems
- URL: http://arxiv.org/abs/2512.18436v1
- Date: Sat, 20 Dec 2025 17:22:52 GMT
- Title: VeruSAGE: A Study of Agent-Based Verification for Rust Systems
- Authors: Chenyuan Yang, Natalie Neamtu, Chris Hawblitzel, Jacob R. Lorch, Shan Lu,
- Abstract summary: This paper offers a comprehensive study of large language models' capability to develop correctness proofs for system software written in Rust.<n>We curate a new system-verification benchmark suite, VeruSAGE-Bench, which consists of 849 proof tasks extracted from eight open-source Verus-verified Rust systems.<n>The best LLM-agent combination in our study completes over 80% of system-verification tasks in VeruSAGE-Bench.
- Score: 3.268641886777319
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have shown impressive capability to understand and develop code. However, their capability to rigorously reason about and prove code correctness remains in question. This paper offers a comprehensive study of LLMs' capability to develop correctness proofs for system software written in Rust. We curate a new system-verification benchmark suite, VeruSAGE-Bench, which consists of 849 proof tasks extracted from eight open-source Verus-verified Rust systems. Furthermore, we design different agent systems to match the strengths and weaknesses of different LLMs (o4-mini, GPT-5, Sonnet 4, and Sonnet 4.5). Our study shows that different tools and agent settings are needed to stimulate the system-verification capability of different types of LLMs. The best LLM-agent combination in our study completes over 80% of system-verification tasks in VeruSAGE-Bench. It also completes over 90% of a set of system proof tasks not part of VeruSAGE-Bench because they had not yet been finished by human experts. This result shows the great potential for LLM-assisted development of verified system software.
Related papers
- Specification and Detection of LLM Code Smells [3.53563608080816]
We introduce the concept of LLM code smells and formalize five problematic coding practices related to LLM inference in software systems.<n>We extend the detection tool SpecDetect4AI to cover the newly defined LLM code smells and use it to validate their prevalence in a dataset of 200 open-source LLM systems.
arXiv Detail & Related papers (2025-12-19T19:24:56Z) - LLM4VV: Evaluating Cutting-Edge LLMs for Generation and Evaluation of Directive-Based Parallel Programming Model Compiler Tests [7.6818904666624395]
This paper proposes a dual-LLM system and experiments with the usage of LLMs for the generation of compiler tests.<n>It is evident that LLMs possess the promising potential to generate quality compiler tests and verify them automatically.
arXiv Detail & Related papers (2025-07-29T02:34:28Z) - Open-Source LLMs Collaboration Beats Closed-Source LLMs: A Scalable Multi-Agent System [51.04535721779685]
This paper aims to demonstrate the potential and strengths of open-source collectives.<n>We propose SMACS, a scalable multi-agent collaboration system (MACS) framework with high performance.<n> Experiments on eight mainstream benchmarks validate the effectiveness of our SMACS.
arXiv Detail & Related papers (2025-07-14T16:17:11Z) - A Self-Improving Coding Agent [23.44829720834145]
Large Language Models (LLMs) have spurred interest in deploying LLM agents to undertake tasks in the world.<n>We demonstrate that an agent system, equipped with basic coding tools, can autonomously edit itself, and thereby improve its performance on benchmark tasks.
arXiv Detail & Related papers (2025-04-21T16:58:18Z) - LLM2: Let Large Language Models Harness System 2 Reasoning [65.89293674479907]
Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs.<n>We introduce LLM2, a novel framework that combines an LLM with a process-based verifier.<n>LLMs2 is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs.
arXiv Detail & Related papers (2024-12-29T06:32:36Z) - LLMmap: Fingerprinting For Large Language Models [15.726286532500971]
With as few as 8 interactions, LLMmap can accurately identify 42 different LLM versions with over 95% accuracy.<n>We discuss potential mitigations and demonstrate that, against resourceful adversaries, effective countermeasures may be challenging or even unrealizable.
arXiv Detail & Related papers (2024-07-22T17:59:45Z) - BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions [72.56339136017759]
We introduce BigCodeBench, a benchmark that challenges Large Language Models (LLMs) to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks.<n>Our evaluation shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%.<n>We propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information.
arXiv Detail & Related papers (2024-06-22T15:52:04Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z) - AgentBench: Evaluating LLMs as Agents [99.12825098528212]
Large Language Model (LLM) as agents has been widely acknowledged recently.<n>We present AgentBench, a benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities.
arXiv Detail & Related papers (2023-08-07T16:08:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.