DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM Post-training
- URL: http://arxiv.org/abs/2504.09710v1
- Date: Sun, 13 Apr 2025 20:10:27 GMT
- Title: DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM Post-training
- Authors: Zhenting Wang, Guofeng Cui, Kun Wan, Wentian Zhao,
- Abstract summary: We present a principled curriculum learning framework grounded in the notion of distribution-level learnability.<n>Our framework prioritizes distributions with either high average advantage (exploitation) or low sample count (exploration)<n>Our experiments show that our framework significantly improves convergence speed and final performance.
- Score: 15.74527731339671
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in reinforcement learning (RL)-based post-training have led to notable improvements in large language models (LLMs), particularly in enhancing their reasoning capabilities to handle complex tasks. However, most existing methods treat the training data as a unified whole, overlooking the fact that modern LLM training often involves a mixture of data from diverse distributions-varying in both source and difficulty. This heterogeneity introduces a key challenge: how to adaptively schedule training across distributions to optimize learning efficiency. In this paper, we present a principled curriculum learning framework grounded in the notion of distribution-level learnability. Our core insight is that the magnitude of policy advantages reflects how much a model can still benefit from further training on a given distribution. Based on this, we propose a distribution-level curriculum learning framework for RL-based LLM post-training, which leverages the Upper Confidence Bound (UCB) principle to dynamically adjust sampling probabilities for different distrubutions. This approach prioritizes distributions with either high average advantage (exploitation) or low sample count (exploration), yielding an adaptive and theoretically grounded training schedule. We instantiate our curriculum learning framework with GRPO as the underlying RL algorithm and demonstrate its effectiveness on logic reasoning datasets with multiple difficulties and sources. Our experiments show that our framework significantly improves convergence speed and final performance, highlighting the value of distribution-aware curriculum strategies in LLM post-training. Code: https://github.com/ZhentingWang/DUMP.
Related papers
- Large Language Models as Attribution Regularizers for Efficient Model Training [0.0]
Large Language Models (LLMs) have demonstrated remarkable performance across diverse domains.<n>We introduce a novel yet straightforward method for incorporating LLM-generated global task feature attributions into the training process of smaller networks.<n>Our approach yields superior performance in few-shot learning scenarios.
arXiv Detail & Related papers (2025-02-27T16:55:18Z) - Escaping Collapse: The Strength of Weak Data for Large Language Model Training [15.77316232527746]
We develop a theoretical framework to investigate how much curation is needed in order to ensure that LLM performance continually improves.
We describe a training procedure that converges to an optimal LLM even if almost all of the non-synthetic training data is of poor quality.
arXiv Detail & Related papers (2025-02-13T03:20:37Z) - A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs [74.35290684163718]
A primary challenge in large language model (LLM) development is their onerous pre-training cost.
This paper explores a promising paradigm to improve LLM pre-training efficiency and quality by leveraging a small language model (SLM)
arXiv Detail & Related papers (2024-10-24T14:31:52Z) - Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach [0.9549646359252346]
In deep Reinforcement Learning (RL) models trained using gradient-based techniques, the choice of gradient and its learning rate are crucial to achieving good performance.<n>We propose dynamic Learning Rate for deep Reinforcement Learning (LRRL), a meta-learning approach that selects the learning rate based on the agent's performance during training.
arXiv Detail & Related papers (2024-10-16T14:15:28Z) - Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate [118.37653302885607]
We present the Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of Large Vision Language Models (LVLMs)
MIR is indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results.
arXiv Detail & Related papers (2024-10-09T17:59:04Z) - Efficient Continual Pre-training by Mitigating the Stability Gap [68.49269649759005]
We study the behavior of Large Language Models (LLMs) during continual pre-training.
We propose three effective strategies to enhance LLM performance within a fixed compute budget.
Our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget.
arXiv Detail & Related papers (2024-06-21T02:28:37Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Generalizing Reward Modeling for Out-of-Distribution Preference Learning [3.9160947065896803]
Preference learning with large language models (LLMs) aims to align the LLMs' generations with human preferences.
Due to the difficulty of obtaining human feedback, discretely training reward models for every encountered distribution is challenging.
This work addresses OOD PL by optimizing a general reward model through a meta-learning approach.
arXiv Detail & Related papers (2024-02-22T18:20:33Z) - Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data.
Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets.
We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - Exploration with Multi-Sample Target Values for Distributional
Reinforcement Learning [20.680417111485305]
We introduce multi-sample target values (MTV) for distributional RL, as a principled replacement for single-sample target value estimation.
The improved distributional estimates lend themselves to UCB-based exploration.
We evaluate our approach on a range of continuous control tasks and demonstrate state-of-the-art model-free performance on difficult tasks such as Humanoid control.
arXiv Detail & Related papers (2022-02-06T03:27:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.