TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models
- URL: http://arxiv.org/abs/2602.15449v1
- Date: Tue, 17 Feb 2026 09:29:18 GMT
- Title: TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models
- Authors: Chansung Park, Juyong Jiang, Fan Wang, Sayak Paul, Jiasi Shen, Jing Tang, Jianguo Li,
- Abstract summary: Large Language Models (LLMs) are changing the coding paradigm, yet synthesizingically sophisticated and robust code still remains a critical challenge.<n>We propose Test-driven and cApability-adaptive cuRriculum reinfOrcement fineTuning (TAROT) to address this need.
- Score: 26.385183692191873
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are changing the coding paradigm, known as vibe coding, yet synthesizing algorithmically sophisticated and robust code still remains a critical challenge. Incentivizing the deep reasoning capabilities of LLMs is essential to overcoming this hurdle. Reinforcement Fine-Tuning (RFT) has emerged as a promising strategy to address this need. However, most existing approaches overlook the heterogeneous difficulty and granularity inherent in test cases, leading to an imbalanced distribution of reward signals and consequently biased gradient updates during training. To address this, we propose Test-driven and cApability-adaptive cuRriculum reinfOrcement fine-Tuning (TAROT). TAROT systematically constructs, for each problem, a four-tier test suite (basic, intermediate, complex, edge), providing a controlled difficulty landscape for curriculum design and evaluation. Crucially, TAROT decouples curriculum progression from raw reward scores, enabling capability-conditioned evaluation and principled selection from a portfolio of curriculum policies rather than incidental test-case difficulty composition. This design fosters stable optimization and more efficient competency acquisition. Extensive experimental results reveal that the optimal curriculum for RFT in code generation is closely tied to a model's inherent capability, with less capable models achieving greater gains with an easy-to-hard progression, whereas more competent models excel under a hard-first curriculum. TAROT provides a reproducible method that adaptively tailors curriculum design to a model's capability, thereby consistently improving the functional correctness and robustness of the generated code. All code and data are released to foster reproducibility and advance community research at https://github.com/deep-diver/TAROT.
Related papers
- ATGen: Adversarial Reinforcement Learning for Test Case Generation [78.48498301767079]
Large Language Models (LLMs) excel at code generation, yet their outputs often contain subtle bugs.<n>Existing test generation methods rely on static datasets.<n>We introduce ATGen, a framework that trains a test case generator via adversarial reinforcement learning.
arXiv Detail & Related papers (2025-10-16T12:49:25Z) - Code-driven Number Sequence Calculation: Enhancing the inductive Reasoning Abilities of Large Language Models [44.17697803306198]
We introduce textitCodeSeq, a synthetic post-training dataset built from number sequences.<n>Our pipeline generates supervised fine data by reflecting on failed test cases and incorporating iterative corrections.<n> Experimental results show that the models trained with textitCodeSeq improve on various reasoning tasks and can preserve the models' OOD performance.
arXiv Detail & Related papers (2025-10-16T12:29:40Z) - Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning [65.20602712957725]
Caco is a novel framework that automates the synthesis of high-quality, verifiable, and diverse instruction-CoT reasoning data.<n>Our work establishes a paradigm for building self-sustaining, trustworthy reasoning systems without human intervention.
arXiv Detail & Related papers (2025-10-05T07:59:24Z) - Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding [59.60915947702282]
Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in enhancing the reasoning capabilities of large language models (LLMs)<n>Existing RLVR methods often suffer from exploration inefficiency due to mismatches between the training data's difficulty and the model's capability.<n>We propose SEELE, a novel supervision-aided RLVR framework that dynamically adjusts problem difficulty to stay within the high-efficiency region.
arXiv Detail & Related papers (2025-09-08T17:36:21Z) - CodeGrad: Integrating Multi-Step Verification with Gradient-Based LLM Refinement [12.792149709662874]
CodeGrad introduces a principled framework that integrates rigorous verification techniques directly into an iterative generation loop.<n>It treats code as a differentiable variable, converting structured feedback and mathematical constraints into a textual pseudo-gradient.<n>We evaluate CodeGrad on the HumanEval, HumanEval+, and LiveCodeBench benchmarks.
arXiv Detail & Related papers (2025-08-12T22:03:54Z) - Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks.<n>However, improvement is plateauing due to the exhaustion of readily available high-quality data.<n>We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z) - Should Code Models Learn Pedagogically? A Preliminary Evaluation of Curriculum Learning for Real-World Software Engineering Tasks [2.0072624123275533]
Recent work shows that Curriculum Learning can improve performance on code-related tasks through incremental learning based on the difficulty of synthetic code.<n>We investigate how the pre-trained code model (CodeT5) learns under CL, through the tasks of code clone detection and code summarization.<n>Our empirical study on the CodeXGLUE benchmark showed contrasting results to prior studies, where the model exhibited signs of catastrophic forgetting and shortcut learning.
arXiv Detail & Related papers (2025-02-06T06:33:08Z) - Model Predictive Task Sampling for Efficient and Robust Adaptation [57.414812940406996]
We introduce Model Predictive Task Sampling (MPTS), a framework that bridges the task space and adaptation risk distributions.<n>MPTS employs a generative model to characterize the episodic optimization process and predicts task-specific adaptation risk via posterior inference.<n>MPTS seamlessly integrates into zero-shot, few-shot, and supervised finetuning settings.
arXiv Detail & Related papers (2025-01-19T13:14:53Z) - SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation [14.786100203787194]
Large language models demonstrate exceptional performance in simple code generation tasks but face challenges in tackling complex problems.<n>We propose a reasoning-augmented data generation process, SRA-MCTS, which guides the model to autonomously generate high-quality intermediate reasoning paths.<n>Our method operates entirely through the model itself without requiring additional supervision.
arXiv Detail & Related papers (2024-11-17T12:31:04Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.