Related papers: Generating Verifiable CoT from Execution-Traces

Generating Verifiable CoT from Execution-Traces

URL: http://arxiv.org/abs/2512.00127v1
Date: Fri, 28 Nov 2025 07:43:43 GMT
Title: Generating Verifiable CoT from Execution-Traces
Authors: Shailja Thakur, Vaibhav Saxena, Rohan Kulkarni, Shivdeep Singh, Parameswaran Selvam, Hima Patel, Hiroshi Kanayama,
Abstract summary: Chain-of-Thought prompting has shown promise, but current synthetic training data suffers from a critical weakness.<n>We address this by grounding CoT generation directly in program execution traces.<n>This execution-grounded approach ensures every reasoning step reflects what the program genuinely computes.
Score: 6.634229408414094
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Teaching language models to reason about code execution remains a fundamental challenge. While Chain-of-Thought (CoT) prompting has shown promise, current synthetic training data suffers from a critical weakness: the reasoning steps are often plausible-sounding explanations generated by teacher models, not verifiable accounts of what the code actually does. This creates a troubling failure mode where models learn to mimic superficially convincing but logically flawed reasoning patterns. We address this by grounding CoT generation directly in program execution traces. Our pipeline instruments code to capture its dynamic behavior, then narrates these verified execution traces into natural language rationales that are correct by construction. This execution-grounded approach ensures every reasoning step reflects what the program genuinely computes, eliminating logical hallucinations at the source. We evaluate our method on code reasoning tasks (forward reasoning on CruxEval and LiveCodeBench-Exec, backward reasoning on CruxEval-Input), as well as code generation and explanation tasks from HumanEval. Models trained on our bi-directional trace-grounded data achieve substantial improvements, with gains of up to 30 points on output prediction and 28 points on input prediction over base models, alongside improved explanation and code generation, demonstrating that verifiable reasoning fundamentally enhances model capabilities. https://github.ibm.com/IBM-Research-AI/Verified-Code-CoT

Related papers

Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training [2.62112541805429]
Reasoning Core is a scalable suite that procedurally generates verifiable symbolic reasoning data across core formal domains.<n>Each task is paired with an external solver for rigorous verification and admits continuous difficulty control for curriculum design.<n>Experiments show that mixing Reasoning Core data into pre-training improves downstream reasoning while preserving, or slightly improving, language modeling quality.
arXiv Detail & Related papers (2026-03-02T18:59:29Z)
LogitsCoder: Towards Efficient Chain-of-Thought Path Search via Logits Preference Decoding for Code Generation [86.08600027874662]
We propose LogitsCoder, a novel framework that enhances chain-of-thought reasoning through lightweight, logit-level control mechanisms for code generation.<n>We show that LogitsCoder produces more efficient and higher-quality reasoning chains, leading to superior code generation performance compared to baseline methods.
arXiv Detail & Related papers (2026-02-15T08:52:19Z)
Demystifying Errors in LLM Reasoning Traces: An Empirical Study of Code Execution Simulation [7.377446354867118]
We conduct the first empirical study on runtime behavior inference with large language models (LLMs)<n>We evaluate four state-of-the-art reasoning LLMs and develop a taxonomy with nine categories of inference errors.<n>Using failures in the Computation category as a case study, our experiments show that this approach corrects 58 percent of such errors.
arXiv Detail & Related papers (2025-11-28T21:29:09Z)
Taming Imperfect Process Verifiers: A Sampling Perspective on Backtracking [54.43083499412643]
Test-time algorithms that combine the generative power of language models with process verifiers offer a promising lever for eliciting new reasoning capabilities.<n>We introduce a new process-guided test-time sampling algorithm, VGB, which uses theoretically grounded backtracking to achieve provably better robustness to verifier errors.
arXiv Detail & Related papers (2025-10-03T16:21:14Z)
On Explaining (Large) Language Models For Code Using Global Code-Based Explanations [45.126233498200534]
Language Models for Code (LLM4Code) have significantly changed the landscape of software engineering (SE)<n>We introduce code rationales (Code$Q$), a technique with rigorous mathematical underpinning, to identify subsets of tokens that can explain individual code predictions.<n>Our evaluation demonstrates that Code$Q$ is a powerful interpretability method to explain how (less) meaningful input concepts (i.e., natural language particle at') highly impact output generation.
arXiv Detail & Related papers (2025-03-21T01:00:45Z)
Code-Driven Inductive Synthesis: Enhancing Reasoning Abilities of Large Language Models with Sequences [38.76458756232632]
We study inductive reasoning in large language models.<n>We use number sequences as the source of inductive reasoning data.<n>We build a sequence synthetic data pipeline and form a training dataset CodeSeq.
arXiv Detail & Related papers (2025-03-17T12:33:26Z)
CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction [47.17755403213469]
We propose CodeI/O, a novel approach that condenses diverse reasoning patterns embedded in contextually-grounded codes.<n>By training models to predict inputs/outputs given code and test cases entirely in natural language, we expose them to universal reasoning primitives.<n> Experimental results demonstrate CodeI/O leads to consistent improvements across symbolic, scientific, logic, math & numerical, and commonsense reasoning tasks.
arXiv Detail & Related papers (2025-02-11T07:26:50Z)
NExT: Teaching Large Language Models to Reason about Code Execution [50.93581376646064]
Large language models (LLMs) of code are typically trained on the surface textual form of programs. We propose NExT, a method to teach LLMs to inspect the execution traces of programs and reason about their run-time behavior.
arXiv Detail & Related papers (2024-04-23T01:46:32Z)
Code Execution with Pre-trained Language Models [88.04688617516827]
Most pre-trained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures. We develop a mutation-based data augmentation technique to create a large-scale and realistic Python dataset and task for code execution. We then present CodeExecutor, a Transformer model that leverages code execution pre-training and curriculum learning to enhance its semantic comprehension.
arXiv Detail & Related papers (2023-05-08T10:00:05Z)
ReCode: Robustness Evaluation of Code Generation Models [90.10436771217243]
We propose ReCode, a comprehensive robustness evaluation benchmark for code generation models. We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format. With human annotators, we verified that over 90% of the perturbed prompts do not alter the semantic meaning of the original prompt.
arXiv Detail & Related papers (2022-12-20T14:11:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.