Related papers: Curriculum-RLAIF: Curriculum Alignment with Reinforcement Learning from AI Feedback

Curriculum-RLAIF: Curriculum Alignment with Reinforcement Learning from AI Feedback

URL: http://arxiv.org/abs/2505.20075v1
Date: Mon, 26 May 2025 14:53:08 GMT
Title: Curriculum-RLAIF: Curriculum Alignment with Reinforcement Learning from AI Feedback
Authors: Mengdi Li, Jiaye Lin, Xufeng Zhao, Wenhao Lu, Peilin Zhao, Stefan Wermter, Di Wang,
Abstract summary: This paper attempts to enhance the generalizability of reward models through a data-centric approach.<n>We propose a novel framework, $textitCurriculum-RLAIF, which constructs preference pairs with varying difficulty levels.<n>Our experimental results suggest that reward models trained with Curriculum-RLAIF achieve improved generalizability.
Score: 36.919559767160415
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reward models trained with conventional Reinforcement Learning from AI Feedback (RLAIF) methods suffer from limited generalizability, which hinders the alignment performance of the policy model during reinforcement learning (RL). This challenge stems from various issues, including distribution shift, preference label noise, and mismatches between overly challenging samples and model capacity. In this paper, we attempt to enhance the generalizability of reward models through a data-centric approach, driven by the insight that these issues are inherently intertwined from the perspective of data difficulty. To address this, we propose a novel framework, $\textit{Curriculum-RLAIF}$, which constructs preference pairs with varying difficulty levels and produces a curriculum that progressively incorporates preference pairs of increasing difficulty for reward model training. Our experimental results suggest that reward models trained with Curriculum-RLAIF achieve improved generalizability, significantly increasing the alignment performance of the policy model by a large margin without incurring additional inference costs compared to various non-curriculum baselines. Detailed analysis and comparisons with alternative approaches, including data selection via external pretrained reward models or internal self-selection mechanisms, as well as other curriculum strategies, further demonstrate the superiority of our approach in terms of simplicity, efficiency, and effectiveness.

Related papers

Progressive Mastery: Customized Curriculum Learning with Guided Prompting for Mathematical Reasoning [43.12759195699103]
Large Language Models (LLMs) have achieved remarkable performance across various reasoning tasks, yet post-training is constrained by inefficient sample utilization and inflexible difficulty samples processing.<n>We propose Customized Curriculum Learning (CCL), a novel framework with two key innovations.<n>First, we introduce model-adaptive difficulty definition that customizes curriculum datasets based on each model's individual capabilities rather than using predefined difficulty metrics.<n>Second, we develop "Guided Prompting," which dynamically reduces sample difficulty through strategic hints, enabling effective utilization of challenging samples that would otherwise degrade performance.
arXiv Detail & Related papers (2025-06-04T15:31:46Z)
Fusing Bidirectional Chains of Thought and Reward Mechanisms A Method for Enhancing Question-Answering Capabilities of Large Language Models for Chinese Intangible Cultural Heritage [3.7756107931620666]
We propose a novel training method that integrates a bidirectional chains of thought and a reward mechanism.<n>This method is built upon ICH-Qwen, a large language model specifically designed for the field of intangible cultural heritage.
arXiv Detail & Related papers (2025-05-13T02:05:25Z)
Model Steering: Learning with a Reference Model Improves Generalization Bounds and Scaling Laws [52.10468229008941]
This paper formalizes an emerging learning paradigm that uses a trained model as a reference to guide and enhance the training of a target model through strategic data selection or weighting.<n>We provide theoretical insights into why this approach improves generalization and data efficiency compared to training without a reference model.<n>Building on these insights, we introduce a novel method for Contrastive Language-Image Pretraining with a reference model, termed DRRho-CLIP.
arXiv Detail & Related papers (2025-05-10T16:55:03Z)
Semi-Supervised Reward Modeling via Iterative Self-Training [52.48668920483908]
We propose Semi-Supervised Reward Modeling (SSRM), an approach that enhances RM training using unlabeled data. We demonstrate that SSRM significantly improves reward models without incurring additional labeling costs. Overall, SSRM substantially reduces the dependency on large volumes of human-annotated data, thereby decreasing the overall cost and time involved in training effective reward models.
arXiv Detail & Related papers (2024-09-10T22:57:58Z)
Take the Bull by the Horns: Hard Sample-Reweighted Continual Training Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data. Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets. We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z)
Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset. We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z)
Pre-trained Recommender Systems: A Causal Debiasing Perspective [19.712997823535066]
We develop a generic recommender that captures universal interaction patterns by training on generic user-item interaction data extracted from different domains. Our empirical studies show that the proposed model could significantly improve the recommendation performance in zero- and few-shot learning settings.
arXiv Detail & Related papers (2023-10-30T03:37:32Z)
Improving Generalization of Alignment with Human Preferences through Group Invariant Learning [56.19242260613749]
Reinforcement Learning from Human Feedback (RLHF) enables the generation of responses more aligned with human preferences. Previous work shows that Reinforcement Learning (RL) often exploits shortcuts to attain high rewards and overlooks challenging samples. We propose a novel approach that can learn a consistent policy via RL across various data groups or domains.
arXiv Detail & Related papers (2023-10-18T13:54:15Z)
Class-Incremental Mixture of Gaussians for Deep Continual Learning [15.49323098362628]
We propose end-to-end incorporation of the mixture of Gaussians model into the continual learning framework. We show that our model can effectively learn in memory-free scenarios with fixed extractors.
arXiv Detail & Related papers (2023-07-09T04:33:19Z)
Model-based Meta Reinforcement Learning using Graph Structured Surrogate Models [40.08137765886609]
We show that our model, called a graph structured surrogate model (GSSM), outperforms state-of-the-art methods in predicting environment dynamics. Our approach is able to obtain high returns, while allowing fast execution during deployment by avoiding test time policy gradient optimization.
arXiv Detail & Related papers (2021-02-16T17:21:55Z)
Learning Diverse Representations for Fast Adaptation to Distribution Shift [78.83747601814669]
We present a method for learning multiple models, incorporating an objective that pressures each to learn a distinct way to solve the task. We demonstrate our framework's ability to facilitate rapid adaptation to distribution shift.
arXiv Detail & Related papers (2020-06-12T12:23:50Z)
Progressive Multi-Stage Learning for Discriminative Tracking [25.94944743206374]
We propose a joint discriminative learning scheme with the progressive multi-stage optimization policy of sample selection for robust visual tracking. The proposed scheme presents a novel time-weighted and detection-guided self-paced learning strategy for easy-to-hard sample selection. Experiments on the benchmark datasets demonstrate the effectiveness of the proposed learning framework.
arXiv Detail & Related papers (2020-04-01T07:01:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.