Related papers: The 4/$δ$ Bound: Designing Predictable LLM-Verifier Systems for Formal Method Guarantee

The 4/$δ$ Bound: Designing Predictable LLM-Verifier Systems for Formal Method Guarantee

URL: http://arxiv.org/abs/2512.02080v1
Date: Sun, 30 Nov 2025 22:19:09 GMT
Title: The 4/$δ$ Bound: Designing Predictable LLM-Verifier Systems for Formal Method Guarantee
Authors: PIerre Dantas, Lucas Cordeiro, Youcheng Sun, Waldir Junior,
Abstract summary: This work bridges the gap by developing an LLM-Verifier Convergence Theorem.<n>We model the interaction between the LLM and the verifier as a discrete-time Markov Chain.<n>We stress-tested this prediction in an extensive empirical campaign comprising more than 90,000 trials.
Score: 5.345468714252351
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The idea of using Formal Verification tools with large language models (LLMs) has enabled scaling software verification beyond manual workflows. However, current methods remain unreliable. Without a solid theoretical footing, the refinement process can wander; sometimes it settles, sometimes it loops back, and sometimes it breaks away from any stable trajectory. This work bridges this critical gap by developing an LLM-Verifier Convergence Theorem, providing the first formal framework with provable guarantees for termination and convergence. We model the interaction between the LLM and the verifier as a discrete-time Markov Chain, with state transitions determined by a key parameter: the error-reduction probability ($δ$). The procedure reaching the Verified state almost surely demonstrates that the program terminates for any $δ> 0$, with an expected iteration count bounded by $\mathbb{E}[n] \leq 4/δ$. We then stress-tested this prediction in an extensive empirical campaign comprising more than 90,000 trials. The empirical results match the theory with striking consistency. Every single run reached verification, and the convergence factor clustered tightly around $C_f\approx$ 1.0. Consequently, the bound mirrors the system's actual behavior. The evidence is sufficiently robust to support dividing the workflow into three distinct operating zones: marginal, practical, and high-performance. Consequently, we establish the design thresholds with absolute confidence. Together, the theoretical guarantee and the experimental evidence provide a clearer architectural foundation for LLM-assisted verification. Heuristic tuning no longer has to be carried out by the system. Engineers gain a framework that supports predictable resource planning and performance budgeting, precisely what is needed before deploying these pipelines into safety-critical software environments.

Related papers

Information Fidelity in Tool-Using LLM Agents: A Martingale Analysis of the Model Context Protocol [69.11739400975445]
We introduce the first theoretical framework for analyzing error accumulation in Model Context Protocol (MCP) agents.<n>We show that cumulative distortion exhibits linear growth and high-probability deviations bounded by $O(sqrtT)$.<n>Key findings include: semantic weighting reduces distortion by 80%, and periodic re-grounding approximately every 9 steps suffices for error control.
arXiv Detail & Related papers (2026-02-10T21:08:53Z)
Veri-Sure: A Contract-Aware Multi-Agent Framework with Temporal Tracing and Formal Verification for Correct RTL Code Generation [4.723302382132762]
silicon-grade correctness remains bottlenecked by: (i) limited test coverage and reliability of simulation-centric evaluation, (ii) regressions and repair hallucinations, and (iii) semantic drift as intent is reinterpreted across agent handoffs.<n>We propose Veri-Sure, a multi-agent framework that establishes a design contract to align agents' intent and uses a patching mechanism guided by static dependency slicing to perform precise, localized repairs.
arXiv Detail & Related papers (2026-01-27T16:10:23Z)
Managing the Stochastic: Foundations of Learning in Neuro-Symbolic Systems for Software Engineering [0.27195102129094995]
Current approaches to AI coding agents blur the lines between the Large Language Model and the agent itself.<n>This paper proposes setting the control boundary such that the LLM is treated as a component of the environment environment.
arXiv Detail & Related papers (2025-12-18T15:28:21Z)
Uncertainty-Guided Expert-AI Collaboration for Efficient Soil Horizon Annotation [0.13999481573773068]
We apply conformal prediction to $textitSoilNet$, a multimodal multitask model for describing soil profiles.<n>We design a simulated human-in-the-loop (HIL) annotation pipeline, where a limited budget for obtaining ground truth annotations is available when model uncertainty is high.<n>Experiments show that conformalizing SoilNet leads to more efficient annotation in regression tasks and comparable performance scores in classification tasks.
arXiv Detail & Related papers (2025-09-29T14:54:23Z)
A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA [65.38186593873313]
Multi-Hop Question Answering (MHQA) requires integrating dispersed, interdependent evidence through sequential reasoning under noise.<n>We introduce a proof-of-concept multi-call framework for MHQA, InfoQA.<n>We construct a stringent and noise-rich benchmark to validate our theory and framework.
arXiv Detail & Related papers (2025-09-25T14:11:57Z)
Tractable Asymmetric Verification for Large Language Models via Deterministic Replicability [0.6117371161379209]
The landscape of Large Language Models (LLMs) shifts rapidly towards dynamic, multi-agent systems.<n>This paper proposes a verification framework that achieves tractable asymmetric effort.<n>We show that targeted verification can be over 12 times faster than full regeneration.
arXiv Detail & Related papers (2025-09-14T03:30:06Z)
LLMs are Bayesian, in Expectation, not in Realization [0.0]
Large language models adapt to new tasks without parameter updates.<n>Recent empirical findings reveal a fundamental contradiction: transformers systematically violate the martingale property.<n>This violation challenges the theoretical foundations underlying uncertainty quantification in critical applications.
arXiv Detail & Related papers (2025-07-15T22:20:11Z)
Probabilistically Tightened Linear Relaxation-based Perturbation Analysis for Neural Network Verification [83.25968588249776]
We present a novel framework that combines over-approximation techniques from LiRPA-based approaches with a sampling-based method to compute tight intermediate reachable sets.<n>With negligible computational overhead, $textttPT-LiRPA$ exploiting the estimated reachable sets, significantly tightens the lower and upper linear bounds of a neural network's output.
arXiv Detail & Related papers (2025-07-07T18:45:53Z)
Towards Automated Formal Verification of Backend Systems with LLMs [9.66648456498893]
We propose a novel framework that leverages functional programming and type systems to translate backend code into formal Lean representations.<n>Our pipeline automatically generates theorems that specify the intended behavior of APIs and database operations, and uses LLM-based provers to verify them.<n>We evaluate our method on realistic backend systems and find that it can formally verify over 50% of the test requirements, which suggests that half of a testing engineer's workload can be automated.
arXiv Detail & Related papers (2025-04-13T16:49:37Z)
Supervised Optimism Correction: Be Confident When LLMs Are Sure [91.7459076316849]
We establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning.<n>We show that the widely used beam search method suffers from unacceptable over-optimism.<n>We propose Supervised Optimism Correction, which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations.
arXiv Detail & Related papers (2025-04-10T07:50:03Z)
Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs [71.7892165868749]
Commercial Large Language Model (LLM) APIs create a fundamental trust problem.<n>Users pay for specific models but have no guarantee that providers deliver them faithfully.<n>We formalize this model substitution problem and evaluate detection methods under realistic adversarial conditions.<n>We propose and evaluate the use of Trusted Execution Environments (TEEs) as one practical and robust solution.
arXiv Detail & Related papers (2025-04-07T03:57:41Z)
Rethinking Uncertainty Estimation in Natural Language Generation [6.3398383724486544]
Large Language Models (LLMs) are increasingly employed in real-world applications.<n>Uncertainty estimation methods generate and analyze multiple output sequences to determine the LLM's uncertainty.<n>We propose G-NLL, which has the advantage of being obtained using only a single output sequence.
arXiv Detail & Related papers (2024-12-19T18:51:06Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.