Self-Evolving Curriculum for LLM Reasoning
- URL: http://arxiv.org/abs/2505.14970v2
- Date: Thu, 29 May 2025 22:28:38 GMT
- Title: Self-Evolving Curriculum for LLM Reasoning
- Authors: Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, Ehsan Kamalloo,
- Abstract summary: Self-Evolving Curriculum (SEC) is an automatic curriculum learning method that learns a curriculum policy concurrently with the RL fine-tuning process.<n>Our experiments demonstrate that SEC significantly improves models' reasoning capabilities, enabling better generalization to harder, out-of-distribution test problems.
- Score: 108.23021254812258
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning (RL) has proven effective for fine-tuning large language models (LLMs), significantly enhancing their reasoning abilities in domains such as mathematics and code generation. A crucial factor influencing RL fine-tuning success is the training curriculum: the order in which training problems are presented. While random curricula serve as common baselines, they remain suboptimal; manually designed curricula often rely heavily on heuristics, and online filtering methods can be computationally prohibitive. To address these limitations, we propose Self-Evolving Curriculum (SEC), an automatic curriculum learning method that learns a curriculum policy concurrently with the RL fine-tuning process. Our approach formulates curriculum selection as a non-stationary Multi-Armed Bandit problem, treating each problem category (e.g., difficulty level or problem type) as an individual arm. We leverage the absolute advantage from policy gradient methods as a proxy measure for immediate learning gain. At each training step, the curriculum policy selects categories to maximize this reward signal and is updated using the TD(0) method. Across three distinct reasoning domains: planning, inductive reasoning, and mathematics, our experiments demonstrate that SEC significantly improves models' reasoning capabilities, enabling better generalization to harder, out-of-distribution test problems. Additionally, our approach achieves better skill balance when fine-tuning simultaneously on multiple reasoning domains. These findings highlight SEC as a promising strategy for RL fine-tuning of LLMs.
Related papers
- DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM Post-training [15.74527731339671]
We present a principled curriculum learning framework grounded in the notion of distribution-level learnability.<n>Our framework prioritizes distributions with either high average advantage (exploitation) or low sample count (exploration)<n>Our experiments show that our framework significantly improves convergence speed and final performance.
arXiv Detail & Related papers (2025-04-13T20:10:27Z) - Causally Aligned Curriculum Learning [69.11672390876763]
This paper studies the problem of curriculum RL through causal lenses.<n>We derive a sufficient graphical condition characterizing causally aligned source tasks.<n>We develop an efficient algorithm to generate a causally aligned curriculum.
arXiv Detail & Related papers (2025-03-21T02:20:38Z) - Learning to Reason at the Frontier of Learnability [1.688134675717698]
Reinforcement learning is now widely adopted as the final stage of large language model training.<n>We demonstrate that many questions are either solved by all attempts - meaning they are already learned - or by none.<n>To address this, we adapt a method from the reinforcement learning literature - sampling for learnability - and apply it to the reinforcement learning stage of LLM training.<n>Our curriculum prioritises questions with high variance of success, i.e. those where the agent sometimes succeeds, but not always.
arXiv Detail & Related papers (2025-02-17T19:16:37Z) - Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization [22.67700436936984]
We introduce Direct Advantage Policy Optimization (DAPO), a novel step-level offline reinforcement learning algorithm.<n>DAPO employs a critic function to predict the reasoning accuracy at each step, thereby generating dense signals to refine the generation strategy.<n>Our results show that DAPO can effectively enhance the mathematical and code capabilities on both SFT models and RL models, demonstrating the effectiveness of DAPO.
arXiv Detail & Related papers (2024-12-24T08:39:35Z) - Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone [72.17534881026995]
We develop an offline and online fine-tuning approach called policy-agnostic RL (PA-RL)<n>We show the first result that successfully fine-tunes OpenVLA, a 7B generalist robot policy, autonomously with Cal-QL, an online RL fine-tuning algorithm.
arXiv Detail & Related papers (2024-12-09T17:28:03Z) - Self-Supervised Curriculum Generation for Autonomous Reinforcement
Learning without Task-Specific Knowledge [25.168236693829783]
A significant bottleneck in applying current reinforcement learning algorithms to real-world scenarios is the need to reset the environment between every episode.
We propose a novel ARL algorithm that can generate a curriculum adaptive to the agent's learning progress without task-specific knowledge.
arXiv Detail & Related papers (2023-11-15T18:40:10Z) - Action-Quantized Offline Reinforcement Learning for Robotic Skill
Learning [68.16998247593209]
offline reinforcement learning (RL) paradigm provides recipe to convert static behavior datasets into policies that can perform better than the policy that collected the data.
In this paper, we propose an adaptive scheme for action quantization.
We show that several state-of-the-art offline RL methods such as IQL, CQL, and BRAC improve in performance on benchmarks when combined with our proposed discretization scheme.
arXiv Detail & Related papers (2023-10-18T06:07:10Z) - Understanding Curriculum Learning in Policy Optimization for Online
Combinatorial Optimization [66.35750142827898]
This paper presents the first systematic study on policy optimization methods for online CO problems.
We show that online CO problems can be naturally formulated as latent Markov Decision Processes (LMDPs), and prove convergence bounds on natural policy gradient (NPG)
Furthermore, our theory explains the benefit of curriculum learning: it can find a strong sampling policy and reduce the distribution shift.
arXiv Detail & Related papers (2022-02-11T03:17:15Z) - Text Generation with Efficient (Soft) Q-Learning [91.47743595382758]
Reinforcement learning (RL) offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward.
We introduce a new RL formulation for text generation from the soft Q-learning perspective.
We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation.
arXiv Detail & Related papers (2021-06-14T18:48:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.