AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms
- URL: http://arxiv.org/abs/2602.09464v1
- Date: Tue, 10 Feb 2026 06:58:26 GMT
- Title: AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms
- Authors: Haoyu Zhao, Ziran Yang, Jiawei Li, Deyuan He, Zenan Li, Chi Jin, Venugopal V. Veeravalli, Aarti Gupta, Sanjeev Arora,
- Abstract summary: Existing benchmarks test only individual languages/tools, so the performance numbers are not directly comparable.<n>We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean.
- Score: 54.99368693313797
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for cross-paradigm evaluation is lacking. Existing benchmarks test only individual languages/tools (e.g., Dafny, Verus, and Lean) and each covers very different tasks, so the performance numbers are not directly comparable. We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean. By enforcing identical functional contracts, AlgoVeri reveals critical capability gaps in verification systems. While frontier models achieve tractable success in Dafny ($40.3$% for Gemini-3 Flash), where high-level abstractions and SMT automation simplify the workflow, performance collapses under the systems-level memory constraints of Verus ($24.7$%) and the explicit proof construction required by Lean (7.8%). Beyond aggregate metrics, we uncover a sharp divergence in test-time compute dynamics: Gemini-3 effectively utilizes iterative repair to boost performance (e.g., tripling pass rates in Dafny), whereas GPT-OSS saturates early. Finally, our error analysis shows that language design affects the refinement trajectory: while Dafny allows models to focus on logical correctness, Verus and Lean trap models in persistent syntactic and semantic barriers. All data and evaluation code can be found at https://github.com/haoyuzhao123/algoveri.
Related papers
- SPARC: Scenario Planning and Reasoning for Automated C Unit Test Generation [1.0010193170880752]
We introduce a neuro-symbolic, scenario-based framework that bridges the gap between high-level program intent and the rigid syntactic constraints of pointer arithmetic and manual memory management.<n>We evaluate it on 59 real-world and algorithmic subjects, where it outperforms the vanilla prompt generation baseline by 31.36% in line coverage, 26.01% in branch coverage, and 20.78% in mutation score, matching or exceeding the symbolic execution tool KLEE.
arXiv Detail & Related papers (2026-02-18T18:09:03Z) - Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models [96.0074341403456]
Inference-time compute has re-emerged as a practical way to improve LLM reasoning.<n>Most test-time scaling (TTS) algorithms rely on autoregressive decoding.<n>We propose Prism, an efficient TTS framework for dLLMs.
arXiv Detail & Related papers (2026-02-02T09:14:51Z) - Do Large Language Models Respect Contracts? Evaluating and Enforcing Contract-Adherence in Code Generation [11.445615378917578]
PACT is a program assessment and contract-adherence evaluation framework.<n>It provides a comprehensive test-suite corpus focused on contract violations.<n>It enables a systematic analysis of code generation under varied prompting conditions.
arXiv Detail & Related papers (2025-10-14T01:12:37Z) - VeriEquivBench: An Equivalence Score for Ground-Truth-Free Evaluation of Formally Verifiable Code [25.916111156888235]
We introduce a new benchmark for formal verification of Large Language Models (LLMs)<n>Our framework replaces ground-truth matching with a formally grounded metric, the equivalence score, and rigorously verifies the quality of generated specifications and code.<n>Our results show that generating formally verifiable code remains a profound challenge for state-of-the-art LLMs.
arXiv Detail & Related papers (2025-10-07T13:19:05Z) - From Benchmark Data To Applicable Program Repair: An Experience Report [1.6913109767046948]
This paper describes our approach to automated program repair.<n>We combine various techniques from the literature to achieve this.<n>Experiments show that our approach performs better than other techniques on standard benchmarks.<n>On closer inspection, none of these techniques work on realistic defects that we see in industry.
arXiv Detail & Related papers (2025-08-22T03:59:27Z) - VERINA: Benchmarking Verifiable Code Generation [46.582574591358735]
Large language models (LLMs) are increasingly integrated in software development.<n> Ensuring correctness in LLM-generated code remains challenging.<n>Verifiable code generation offers a promising path to address this limitation.
arXiv Detail & Related papers (2025-05-29T06:12:52Z) - Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks.<n>However, improvement is plateauing due to the exhaustion of readily available high-quality data.<n>We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z) - COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity.
We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points.
We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z) - Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph.
It is updated by decoding in the context of an auto-encoder.
Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.