R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning
- URL: http://arxiv.org/abs/2505.21668v1
- Date: Tue, 27 May 2025 18:47:33 GMT
- Title: R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning
- Authors: Yongchao Chen, Yueying Liu, Junwei Zhou, Yilun Hao, Jingquan Wang, Yang Zhang, Chuchu Fan,
- Abstract summary: We present R1-Code-Interpreter, an extension of a text-only Large Language Models (LLMs) trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL)<n>R1-Code-Interpreter autonomously generates multiple code queries during step-by-step reasoning.<n>Unlike prior RL work on narrow domains, we find that Code Interpreter training is significantly harder due to high task diversity and expensive code execution.
- Score: 14.208804782749793
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Despite advances in reasoning and planning of R1-like models, Large Language Models (LLMs) still struggle with tasks requiring precise computation, symbolic manipulation, optimization, and algorithmic reasoning, in which textual reasoning lacks the rigor of code execution. A key challenge is enabling LLMs to decide when to use textual reasoning versus code generation. While OpenAI trains models to invoke a Code Interpreter as needed, public research lacks guidance on aligning pre-trained LLMs to effectively leverage code and generalize across diverse tasks. We present R1-Code-Interpreter, an extension of a text-only LLM trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL) to autonomously generate multiple code queries during step-by-step reasoning. We curate 144 reasoning and planning tasks (107 for training, 37 for testing), each with over 200 diverse questions. We fine-tune Qwen-2.5 models (3B/7B/14B) using various SFT and RL strategies, investigating different answer formats, reasoning vs. non-reasoning models, cold vs. warm starts, GRPO vs. PPO, and masked vs. unmasked code outputs. Unlike prior RL work on narrow domains, we find that Code Interpreter training is significantly harder due to high task diversity and expensive code execution, highlighting the critical role of the SFT stage. Our final model, R1-CI-14B, improves average accuracy on the 37 test tasks from 44.0\% to 64.1\%, outperforming GPT-4o (text-only: 58.6\%) and approaching GPT-4o with Code Interpreter (70.9\%), with the emergent self-checking behavior via code generation. Datasets, Codes, and Models are available at https://github.com/yongchao98/R1-Code-Interpreter and https://huggingface.co/yongchao98.
Related papers
- Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code [76.80306464249217]
We propose TeaR, which aims at teaching LLMs to reason better.<n>TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks.<n>We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning.
arXiv Detail & Related papers (2025-07-10T07:34:05Z) - Zero-Shot Detection of LLM-Generated Code via Approximated Task Conditioning [8.571111167616165]
Large Language Model (LLM)-generated code is a growing challenge with implications for security, intellectual property, and academic integrity.<n>We investigate the role of conditional probability distributions in improving zero-shot LLM-generated code detection.<n>We propose a novel zero-shot detection approach that approximates the original task used to generate a given code snippet.
arXiv Detail & Related papers (2025-06-06T13:23:37Z) - Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning [50.419872452397684]
Search-R1 is an extension of reinforcement learning for reasoning frameworks.<n>It generates search queries during step-by-step reasoning with real-time retrieval.<n>It improves performance by 41% (Qwen2.5-7B) and 20% (Qwen2.5-3B) over various RAG baselines.
arXiv Detail & Related papers (2025-03-12T16:26:39Z) - Crystal: Illuminating LLM Abilities on Language and Code [58.5467653736537]
We propose a pretraining strategy to enhance the integration of natural language and coding capabilities.
The resulting model, Crystal, demonstrates remarkable capabilities in both domains.
arXiv Detail & Related papers (2024-11-06T10:28:46Z) - Steering Large Language Models between Code Execution and Textual Reasoning [22.279107036500083]
Textual reasoning has inherent limitations in solving tasks with challenges in math, logics, optimization, and searching.<n>OpenAI GPT Code Interpreter and multi-agent frameworks such as AutoGen have demonstrated remarkable proficiency of integrating code generation and execution.<n>We propose three methods to better steer LLM code/text generation and achieve a notable improvement.
arXiv Detail & Related papers (2024-10-04T15:44:47Z) - Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large Language Models [6.646510073473929]
We propose SlimCode, a model-agnostic code simplification solution for Large Language Models.
SlimCode can improve the state-of-the-art technique by 9.46% and 5.15% in terms of MRR and BLEU score on code search and summarization.
arXiv Detail & Related papers (2024-05-18T06:15:52Z) - Is Next Token Prediction Sufficient for GPT? Exploration on Code Logic Comprehension [18.919972400933393]
We propose an advanced pretraining task, "Next Token Prediction+"
Following this pretraining, both Code Llama and StarCoder, the prevalent code domain pretraining models, display significant improvements on our logically equivalent code selection task and the code completion task.
arXiv Detail & Related papers (2024-04-13T03:11:07Z) - Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs [65.2379940117181]
We introduce code prompting, a chain of prompts that transforms a natural language problem into code.
We find that code prompting exhibits a high-performance boost for multiple LLMs.
Our analysis of GPT 3.5 reveals that the code formatting of the input problem is essential for performance improvement.
arXiv Detail & Related papers (2024-01-18T15:32:24Z) - CodeT5+: Open Code Large Language Models for Code Understanding and
Generation [72.1638273937025]
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence.
CodeT5+ is a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks.
We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning.
arXiv Detail & Related papers (2023-05-13T14:23:07Z) - Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward.
We introduce a new RL formulation for text generation from the soft Q-learning perspective.
We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.