Related papers: Chain of Execution Supervision Promotes General Reasoning in Large Language Models

Chain of Execution Supervision Promotes General Reasoning in Large Language Models

URL: http://arxiv.org/abs/2510.23629v1
Date: Fri, 24 Oct 2025 02:21:11 GMT
Title: Chain of Execution Supervision Promotes General Reasoning in Large Language Models
Authors: Nuo Chen, Zehua Li, Keqin Bao, Junyang Lin, Dayiheng Liu,
Abstract summary: We introduce TracePile, a large-scale corpus of 2.6 million samples that transforms code execution into explicit, step-by-step chain-of-thought-style rationales.<n>We evaluate TracePile using three training setups: continue-pretraining, instruction tuning after pretraining, and two-stage finetuning.<n> Notably, TracePile boosts LLaMA3.1-8B by 7.1% on average across nine math datasets and delivers clear gains on LiveCodeBench, CRUX, and MMLU.
Score: 48.100128916029064
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Building robust and general reasoning ability is a central goal in the development of large language models (LLMs). Recent efforts increasingly turn to code as a rich training source, given its inherent logical structure and diverse reasoning paradigms such as divide-and-conquer, topological ordering, and enumeration. However, reasoning in code is often expressed implicitly and entangled with syntactic or implementation noise, making direct training on raw code suboptimal.To address this, we introduce TracePile, a large-scale corpus of 2.6 million samples that transforms code execution into explicit, step-by-step chain-of-thought-style rationales, which we call Chain of Execution (CoE). The corpus spans domains including mathematics, classical algorithms and algorithmic competition, and is enriched with variable-tracing questions and code rewritings to enhance logical granularity and code diversity. We evaluate TracePile using three training setups: continue-pretraining, instruction tuning after pretraining, and two-stage finetuning. Experiments across four base models (LLaMA 3, LLaMA 3.1, Qwen-2.5, and Qwen-2.5 Coder) and 20 benchmarks covering math, code, logic, and algorithms demonstrate consistent improvements. Notably, TracePile boosts LLaMA3.1-8B by 7.1\% on average across nine math datasets and delivers clear gains on LiveCodeBench, CRUX, and MMLU under two-stage fine-tuning.

Related papers

Readability-Robust Code Summarization via Meta Curriculum Learning [53.44612630063336]
In the real world, code is often poorly structured or obfuscated, significantly degrading model performance.<n>We propose RoFTCodeSum, a novel fine-tuning method that enhances the robustness of code summarization against poorly readable code.
arXiv Detail & Related papers (2026-01-09T02:38:24Z)
BRIDGE: Building Representations In Domain Guided Program Verification [67.36686119518441]
BRIDGE decomposes verification into three interconnected domains: Code, Specifications, and Proofs.<n>We show that this approach substantially improves both accuracy and efficiency beyond standard error feedback methods.
arXiv Detail & Related papers (2025-11-26T06:39:19Z)
Lifecycle-Aware code generation: Leveraging Software Engineering Phases in LLMs [12.70863561286374]
We introduce a lifecycle-aware framework that incorporates intermediate artifacts into both the training and inference stages.<n> Experiments show that lifecycle-level fine-tuning improves code correctness by up to 75% over the same model before fine-tuning.<n>Open-source LLMs, once fine-tuned under our framework, match or slightly outperform models pretrained on code.
arXiv Detail & Related papers (2025-10-28T02:54:02Z)
Code-enabled language models can outperform reasoning models on diverse tasks [86.29363856881399]
We show that standard instruct LMs can already be elicited to be strong reasoners without finetuning.<n>This is achieved by CodeAdapt, where LMs interleave natural language reasoning with code execution in a multi-step fashion.<n>We find that CodeAdapt enables three LMs to outperform the corresponding RMs on average over eight tasks.
arXiv Detail & Related papers (2025-10-23T18:04:03Z)
CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment [98.87395842351627]
Large Language Models (LLMs) excel at code generation by learning from vast code corpora.<n>A fundamental semantic gap remains between their training on textual patterns and the goal of functional correctness.<n>We propose CodeRL+, a novel approach that integrates execution semantics alignment into the RLVR training pipeline for code generation.
arXiv Detail & Related papers (2025-10-21T09:48:06Z)
CodeReasoner: Enhancing the Code Reasoning Ability with Reinforcement Learning [8.197518276987989]
Code reasoning is a fundamental capability for large language models (LLMs) in the code domain.<n>Prior approaches mainly rely on supervised fine-tuning to improve performance in code reasoning tasks.<n>We argue this is due to two core issues: the low quality of training data and the limitations of supervised fine-tuning.<n>We propose CodeReasoner, a framework that spans both dataset construction and a two-stage training process.
arXiv Detail & Related papers (2025-07-23T14:26:58Z)
Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code [76.80306464249217]
We propose TeaR, which aims at teaching LLMs to reason better.<n>TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks.<n>We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning.
arXiv Detail & Related papers (2025-07-10T07:34:05Z)
From Reasoning to Code: GRPO Optimization for Underrepresented Languages [0.7864304771129751]
This paper introduces a generalizable approach that uses small-scale code versions of the Qwen 2.5 model combined with Group Relative Policy Optimization.<n>It produces logically consistent and syntactically accurate code by directly integrating reasoning-driven feedback into the reinforcement learning loop.
arXiv Detail & Related papers (2025-05-20T11:28:48Z)
EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking [58.15568681219339]
We introduce EquiBench, a new benchmark for evaluating large language models (LLMs)<n>This task directly tests a model's ability to reason about program semantics.<n>We evaluate 19 state-of-the-art LLMs and find that in the most challenging categories, the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline.
arXiv Detail & Related papers (2025-02-18T02:54:25Z)
CodeMind: Evaluating Large Language Models for Code Reasoning [6.819757372634151]
Large Language Models (LLMs) have been widely used to automate programming tasks.<n>This paper introduces CodeMind, a framework designed to gauge the code reasoning abilities of LLMs.
arXiv Detail & Related papers (2024-02-15T02:24:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.