Related papers: Towards Automated Formal Verification of Backend Systems with LLMs

Towards Automated Formal Verification of Backend Systems with LLMs

URL: http://arxiv.org/abs/2506.10998v1
Date: Sun, 13 Apr 2025 16:49:37 GMT
Title: Towards Automated Formal Verification of Backend Systems with LLMs
Authors: Kangping Xu, Yifan Luo, Yang Yuan, Andrew Chi-Chih Yao,
Abstract summary: We propose a novel framework that leverages functional programming and type systems to translate backend code into formal Lean representations.<n>Our pipeline automatically generates theorems that specify the intended behavior of APIs and database operations, and uses LLM-based provers to verify them.<n>We evaluate our method on realistic backend systems and find that it can formally verify over 50% of the test requirements, which suggests that half of a testing engineer's workload can be automated.
Score: 9.66648456498893
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Software testing plays a critical role in ensuring that systems behave as intended. However, existing automated testing approaches struggle to match the capabilities of human engineers due to key limitations such as test locality, lack of general reliability, and business logic blindness. In this work, we propose a novel framework that leverages functional programming and type systems to translate Scala backend code into formal Lean representations. Our pipeline automatically generates theorems that specify the intended behavior of APIs and database operations, and uses LLM-based provers to verify them. When a theorem is proved, the corresponding logic is guaranteed to be correct and no further testing is needed. If the negation of a theorem is proved instead, it confirms a bug. In cases where neither can be proved, human intervention is required. We evaluate our method on realistic backend systems and find that it can formally verify over 50% of the test requirements, which suggests that half of a testing engineer's workload can be automated. Additionally, with an average cost of only $2.19 per API, LLM-based verification is significantly more cost-effective than manual testing and can be scaled easily through parallel execution. Our results indicate a promising direction for scalable, AI-powered software testing, with the potential to greatly improve engineering productivity as models continue to advance.

Related papers

Re:Form -- Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny [68.00108157244952]
Large Language Models (LLMs) trained with Reinforcement Learning (RL) face a significant challenge: their verification processes are neither reliable nor scalable.<n>A promising yet largely uncharted alternative is formal language-based reasoning.<n>Grounding LLMs in rigorous formal systems where generative models operate in formal language spaces (e.g., Dafny) enables the automatic and mathematically provable verification of their reasoning processes and outcomes.
arXiv Detail & Related papers (2025-07-22T08:13:01Z)
Do AI models help produce verified bug fixes? [62.985237003585674]
Large Language Models are used to produce corrections to software bugs.<n>This paper investigates how programmers use Large Language Models to complement their own skills.<n>The results are a first step towards a proper role for AI and LLMs in providing guaranteed-correct fixes to program bugs.
arXiv Detail & Related papers (2025-07-21T17:30:16Z)
Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z)
APE-Bench I: Towards File-level Automated Proof Engineering of Formal Math Libraries [5.227446378450704]
APE-Bench I is the first realistic benchmark built from real-world commit histories of Mathlib4.<n>Eleanstic is a scalable parallel verification infrastructure optimized for proof checking across multiple versions of Mathlib.
arXiv Detail & Related papers (2025-04-27T05:04:02Z)
Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs [60.881609323604685]
Large Language Models (LLMs) accessed via black-box APIs introduce a trust challenge.<n>Users pay for services based on advertised model capabilities.<n> providers may covertly substitute the specified model with a cheaper, lower-quality alternative to reduce operational costs.<n>This lack of transparency undermines fairness, erodes trust, and complicates reliable benchmarking.
arXiv Detail & Related papers (2025-04-07T03:57:41Z)
SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints [59.645885492637845]
SOPBench is an evaluation pipeline that transforms each service-specific SOP code program into a directed graph of executable functions.<n>Our approach transforms each service-specific SOP code program into a directed graph of executable functions and requires agents to call these functions based on natural language SOP descriptions.<n>We evaluate 18 leading models, and results show the task is challenging even for top-tier models.
arXiv Detail & Related papers (2025-03-11T17:53:02Z)
ProofAug: Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis [50.020850767257095]
We propose ProofAug, a procedure that equips LLMs with automation methods at various granularities.<n>Our method is validated on the miniF2F benchmark using the open-source deep-math-7b-base model and the Isabelle proof assistant.<n>We also implement a Lean 4 version of ProofAug that can improve the pass@1 performance of Kimina-Prover-seek-Distill-1.5B from 44.3% to 50.4%.
arXiv Detail & Related papers (2025-01-30T12:37:06Z)
The Potential of LLMs in Automating Software Testing: From Generation to Reporting [0.0]
Manual testing, while effective, can be time consuming and costly, leading to an increased demand for automated methods.<n>Recent advancements in Large Language Models (LLMs) have significantly influenced software engineering.<n>This paper explores an agent-oriented approach to automated software testing, using LLMs to reduce human intervention and enhance testing efficiency.
arXiv Detail & Related papers (2024-12-31T02:06:46Z)
AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs. Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z)
APITestGenie: Automated API Test Generation through Generative AI [2.0716352593701277]
APITestGenie generates executable API test scripts from business requirements and API specifications. In experiments with 10 real-world APIs, the tool generated valid test scripts 57% of the time. Human intervention is recommended to validate or refine generated scripts before integration into CI/CD pipelines.
arXiv Detail & Related papers (2024-09-05T18:02:41Z)
Lean Copilot: Large Language Models as Copilots for Theorem Proving in Lean [81.94024084598598]
We introduce Lean Copilot, a framework for running LLM inference in Lean.<n>We build tools that suggest proof steps, complete proof goals, and select relevant premises.<n>When assisting humans, Lean Copilot requires only 2.08 manually-entered proof steps on average (3.86 required by aesop)<n>When automating the theorem proving process, Lean Copilot 74.2% proof steps on average, 85% better than aesop (40.1%).
arXiv Detail & Related papers (2024-04-18T22:54:08Z)
Automating REST API Postman Test Cases Using LLM [0.0]
This research paper is dedicated to the exploration and implementation of an automated approach to generate test cases using Large Language Models. The methodology integrates the use of Open AI to enhance the efficiency and effectiveness of test case generation. The model that is developed during the research is trained using manually collected postman test cases or instances for various Rest APIs.
arXiv Detail & Related papers (2024-04-16T15:53:41Z)
Test-Driven Development for Code Generation [0.850206009406913]
Large Language Models (LLMs) have demonstrated significant capabilities in generating code snippets directly from problem statements. This paper investigates if and how Test-Driven Development (TDD) can be incorporated into AI-assisted code-generation processes.
arXiv Detail & Related papers (2024-02-21T04:10:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.