Towards Verified Code Reasoning by LLMs
- URL: http://arxiv.org/abs/2509.26546v1
- Date: Tue, 30 Sep 2025 17:17:51 GMT
- Title: Towards Verified Code Reasoning by LLMs
- Authors: Meghana Sistla, Gogul Balakrishnan, Pat Rondon, José Cambronero, Michele Tufano, Satish Chandra,
- Abstract summary: We describe a method to automatically validate the answers provided by a code reasoning agent.<n>The method consists of extracting a formal representation of the agent's response and, subsequently, using formal verification and program analysis tools.
- Score: 6.973151264926856
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While LLM-based agents are able to tackle a wide variety of code reasoning questions, the answers are not always correct. This prevents the agent from being useful in situations where high precision is desired: (1) helping a software engineer understand a new code base, (2) helping a software engineer during code review sessions, and (3) ensuring that the code generated by an automated code generation system meets certain requirements (e.g. fixes a bug, improves readability, implements a feature). As a result of this lack of trustworthiness, the agent's answers need to be manually verified before they can be trusted. Manually confirming responses from a code reasoning agent requires human effort and can result in slower developer productivity, which weakens the assistance benefits of the agent. In this paper, we describe a method to automatically validate the answers provided by a code reasoning agent by verifying its reasoning steps. At a very high level, the method consists of extracting a formal representation of the agent's response and, subsequently, using formal verification and program analysis tools to verify the agent's reasoning steps. We applied this approach to a benchmark set of 20 uninitialized variable errors detected by sanitizers and 20 program equivalence queries. For the uninitialized variable errors, the formal verification step was able to validate the agent's reasoning on 13/20 examples, and for the program equivalence queries, the formal verification step successfully caught 6/8 incorrect judgments made by the agent.
Related papers
- Agentified Assessment of Logical Reasoning Agents [3.5548629490839594]
Building on agentified assessment, we use an assessor agent to issue tasks, enforce execution budgets, parse outputs, and record structured failure types.<n>As a case study, we benchmark an auto-formalization agent for first-order logic (FOL) reasoning on a solver-verified and repaired split of FOLIO.<n>The auto-formalization agent achieves 86.70% accuracy under the assessor protocol, outperforming a chain-of-thought baseline (73.89%)
arXiv Detail & Related papers (2026-03-03T09:26:08Z) - Agentic Code Reasoning [6.246212222645163]
We introduce semi-formal reasoning: a structured prompting methodology that requires agents to construct explicit premises, trace execution paths, and derive formal conclusions.<n>We evaluate three tasks (patch equivalence verification, fault localization, and code question answering) and show that semi-formal reasoning consistently improves accuracy.<n>These results demonstrate that structured agentic reasoning enables meaningful semantic code analysis without execution.
arXiv Detail & Related papers (2026-03-02T14:17:06Z) - Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation [87.47155146067962]
We provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of tasks.<n>We conduct three-dimensional analysis spanning models, scaffolds, and benchmarks.<n>Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs.
arXiv Detail & Related papers (2025-10-13T22:22:28Z) - GSM-Agent: Understanding Agentic Reasoning Using Controllable Environments [56.007498767771075]
GSM-Agent is a novel benchmark for evaluating agentic reasoning in complex environments.<n>We analyze the agentic reasoning patterns by cluster the environment's document embeddings into nodes, and map each tool call to its nearest node.<n>We propose a tool-augmented test-time scaling method to improve LLM's agentic reasoning performance by adding tools to encourage models to revisit.
arXiv Detail & Related papers (2025-09-26T07:24:37Z) - VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection [55.957275374847484]
VulAgent is a multi-agent vulnerability detection framework based on hypothesis validation.<n>It implements a semantics-sensitive, multi-view detection pipeline, each aligned to a specific analysis perspective.<n>On average, VulAgent improves overall accuracy by 6.6%, increases the correct identification rate of vulnerable--fixed code pairs by up to 450%, and reduces the false positive rate by about 36%.
arXiv Detail & Related papers (2025-09-15T02:25:38Z) - AutoCodeSherpa: Symbolic Explanations in AI Coding Agents [10.706082274730734]
Large Language Model (LLM) agents autonomously use external tools on top of one or more LLMs to accomplish specific tasks.<n> Lately LLM agents for software engineering tasks have become popular.<n>This is demonstrated by existing agentic AI solutions such as AutoCodeRover or SpecRover which perform automated program repair.
arXiv Detail & Related papers (2025-07-30T06:34:02Z) - Do AI models help produce verified bug fixes? [62.985237003585674]
Large Language Models are used to produce corrections to software bugs.<n>This paper investigates how programmers use Large Language Models to complement their own skills.<n>The results are a first step towards a proper role for AI and LLMs in providing guaranteed-correct fixes to program bugs.
arXiv Detail & Related papers (2025-07-21T17:30:16Z) - Towards Robust Fact-Checking: A Multi-Agent System with Advanced Evidence Retrieval [1.515687944002438]
The rapid spread of misinformation in the digital era poses significant challenges to public discourse.<n>Traditional human-led fact-checking methods, while credible, struggle with the volume and velocity of online content.<n>This paper proposes a novel multi-agent system for automated fact-checking that enhances accuracy, efficiency, and explainability.
arXiv Detail & Related papers (2025-06-22T02:39:27Z) - Agent-as-a-Judge: Evaluate Agents with Agents [61.33974108405561]
We introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems.
This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process.
We present DevAI, a new benchmark of 55 realistic automated AI development tasks.
arXiv Detail & Related papers (2024-10-14T17:57:02Z) - Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning.
LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors.
We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z) - CodeAgent: Autonomous Communicative Agents for Code Review [12.163258651539236]
This work introduces tool, a novel multi-agent Large Language Model (LLM) system for code review automation.
CodeAgent incorporates a supervisory agent, QA-Checker, to ensure that all the agents' contributions address the initial review question.
Results demonstrate CodeAgent's effectiveness, contributing to a new state-of-the-art in code review automation.
arXiv Detail & Related papers (2024-02-03T14:43:14Z) - GRACE: Discriminator-Guided Chain-of-Thought Reasoning [75.35436025709049]
We propose Guiding chain-of-thought ReAsoning with a CorrectnEss Discriminator (GRACE) to steer the decoding process towards producing correct reasoning steps.
GRACE employs a discriminator trained with a contrastive loss over correct and incorrect steps, which is used during decoding to score next-step candidates.
arXiv Detail & Related papers (2023-05-24T09:16:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.