Related papers: Agentic Code Reasoning

Agentic Code Reasoning

URL: http://arxiv.org/abs/2603.01896v2
Date: Wed, 04 Mar 2026 01:17:10 GMT
Title: Agentic Code Reasoning
Authors: Shubham Ugare, Satish Chandra,
Abstract summary: We introduce semi-formal reasoning: a structured prompting methodology that requires agents to construct explicit premises, trace execution paths, and derive formal conclusions.<n>We evaluate three tasks (patch equivalence verification, fault localization, and code question answering) and show that semi-formal reasoning consistently improves accuracy.<n>These results demonstrate that structured agentic reasoning enables meaningful semantic code analysis without execution.
Score: 6.246212222645163
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Can LLM agents explore codebases and reason about code semantics without executing the code? We study this capability, which we call agentic code reasoning, and introduce semi-formal reasoning: a structured prompting methodology that requires agents to construct explicit premises, trace execution paths, and derive formal conclusions. Unlike unstructured chain-of-thought, semi-formal reasoning acts as a certificate: the agent cannot skip cases or make unsupported claims. We evaluate across three tasks (patch equivalence verification, fault localization, and code question answering) and show that semi-formal reasoning consistently improves accuracy on all of them. For patch equivalence, accuracy improves from 78% to 88% on curated examples and reaches 93% on real-world agent-generated patches, approaching the reliability needed for execution-free RL reward signals. For code question answering on RubberDuckBench Mohammad et al. (2026), semi-formal reasoning achieves 87% accuracy. For fault localization on Defects4J Just et al. (2014), semi-formal reasoning improves Top-5 accuracy by 5 percentage points over standard reasoning. These results demonstrate that structured agentic reasoning enables meaningful semantic code analysis without execution, opening practical applications in RL training pipelines, code review, and static program analysis.

Related papers

Agentified Assessment of Logical Reasoning Agents [3.5548629490839594]
Building on agentified assessment, we use an assessor agent to issue tasks, enforce execution budgets, parse outputs, and record structured failure types.<n>As a case study, we benchmark an auto-formalization agent for first-order logic (FOL) reasoning on a solver-verified and repaired split of FOLIO.<n>The auto-formalization agent achieves 86.70% accuracy under the assessor protocol, outperforming a chain-of-thought baseline (73.89%)
arXiv Detail & Related papers (2026-03-03T09:26:08Z)
Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning [66.22060690012512]
Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy.<n>We propose Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution.
arXiv Detail & Related papers (2026-02-27T20:23:59Z)
RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models [5.733004743054914]
Large Reasoning Models (LRMs) exhibit strong performance, yet often produce rationales that sound plausible but fail to reflect their true decision process.<n>We introduce a formal framework for reasoning faithfulness, defined by two testable conditions.<n>We present RFEval, a benchmark of 7,186 instances that probes faithfulness via controlled, output-level counterfactual interventions.
arXiv Detail & Related papers (2026-02-19T03:49:37Z)
Training LLMs with LogicReward for Faithful and Rigorous Reasoning [75.30425553246177]
We propose LogicReward, a reward system that guides model training by enforcing step-level logical correctness with a theorem prover.<n>An 8B model trained on data constructed with LogicReward surpasses GPT-4o and o4-mini by 11.6% and 2% on natural language inference and logical reasoning tasks.
arXiv Detail & Related papers (2025-12-20T03:43:02Z)
Demystifying Errors in LLM Reasoning Traces: An Empirical Study of Code Execution Simulation [7.377446354867118]
We conduct the first empirical study on runtime behavior inference with large language models (LLMs)<n>We evaluate four state-of-the-art reasoning LLMs and develop a taxonomy with nine categories of inference errors.<n>Using failures in the Computation category as a case study, our experiments show that this approach corrects 58 percent of such errors.
arXiv Detail & Related papers (2025-11-28T21:29:09Z)
HERMES: Towards Efficient and Verifiable Mathematical Reasoning in LLMs [32.234133057592935]
Hermes is a tool-assisted agent that interleaves informal reasoning with verified proof steps in Lean systems.<n>We evaluate Hermes on four challenging mathematical reasoning benchmarks using LLMs of varying parameter scales.
arXiv Detail & Related papers (2025-11-24T04:50:18Z)
ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization [73.0780809974414]
We propose a Reflective Autoformalization method that integrates semantic consistency evaluation into the autoformalization process.<n>This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors.<n>Experiments show that ReForm achieves an average improvement of 22.6 percentage points over the strongest baselines.
arXiv Detail & Related papers (2025-10-28T16:22:54Z)
Towards Verified Code Reasoning by LLMs [6.973151264926856]
We describe a method to automatically validate the answers provided by a code reasoning agent.<n>The method consists of extracting a formal representation of the agent's response and, subsequently, using formal verification and program analysis tools.
arXiv Detail & Related papers (2025-09-30T17:17:51Z)
VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection [55.957275374847484]
VulAgent is a multi-agent vulnerability detection framework based on hypothesis validation.<n>It implements a semantics-sensitive, multi-view detection pipeline, each aligned to a specific analysis perspective.<n>On average, VulAgent improves overall accuracy by 6.6%, increases the correct identification rate of vulnerable--fixed code pairs by up to 450%, and reduces the false positive rate by about 36%.
arXiv Detail & Related papers (2025-09-15T02:25:38Z)
Neuro-Symbolic Integration Brings Causal and Reliable Reasoning Proofs [95.07757789781213]
Two lines of approaches are adopted for complex reasoning with LLMs.<n>One line of work prompts LLMs with various reasoning structures, while the structural outputs can be naturally regarded as intermediate reasoning steps.<n>The other line of work adopt LLM-free declarative solvers to do the reasoning task, rendering higher reasoning accuracy but lacking interpretability due to the black-box nature of the solvers.<n>We present a simple extension to the latter line of work. Specifically, we showcase that the intermediate search logs generated by Prolog interpreters can be accessed and interpreted into human-readable reasoning.
arXiv Detail & Related papers (2023-11-16T11:26:21Z)
PRover: Proof Generation for Interpretable Reasoning over Rules [81.40404921232192]
We propose a transformer-based model that answers binary questions over rule-bases and generates the corresponding proofs. Our model learns to predict nodes and edges corresponding to proof graphs in an efficient constrained training paradigm. We conduct experiments on synthetic, hand-authored, and human-paraphrased rule-bases to show promising results for QA and proof generation.
arXiv Detail & Related papers (2020-10-06T15:47:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.