Improving Generalization of Alignment with Human Preferences through
Group Invariant Learning
- URL: http://arxiv.org/abs/2310.11971v3
- Date: Tue, 26 Dec 2023 02:31:57 GMT
- Title: Improving Generalization of Alignment with Human Preferences through
Group Invariant Learning
- Authors: Rui Zheng, Wei Shen, Yuan Hua, Wenbin Lai, Shihan Dou, Yuhao Zhou,
Zhiheng Xi, Xiao Wang, Haoran Huang, Tao Gui, Qi Zhang, Xuanjing Huang
- Abstract summary: Reinforcement Learning from Human Feedback (RLHF) enables the generation of responses more aligned with human preferences.
Previous work shows that Reinforcement Learning (RL) often exploits shortcuts to attain high rewards and overlooks challenging samples.
We propose a novel approach that can learn a consistent policy via RL across various data groups or domains.
- Score: 56.19242260613749
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The success of AI assistants based on language models (LLMs) hinges crucially
on Reinforcement Learning from Human Feedback (RLHF), which enables the
generation of responses more aligned with human preferences. As universal AI
assistants, there's a growing expectation for them to perform consistently
across various domains. However, previous work shows that Reinforcement
Learning (RL) often exploits shortcuts to attain high rewards and overlooks
challenging samples. This focus on quick reward gains undermines both the
stability in training and the model's ability to generalize to new, unseen
data. In this work, we propose a novel approach that can learn a consistent
policy via RL across various data groups or domains. Given the challenges
associated with acquiring group annotations, our method automatically
classifies data into different groups, deliberately maximizing performance
variance. Then, we optimize the policy to perform well on challenging groups.
Lastly, leveraging the established groups, our approach adaptively adjusts the
exploration space, allocating more learning capacity to more challenging data
and preventing the model from over-optimizing on simpler data. Experimental
results indicate that our approach significantly enhances training stability
and model generalization.
Related papers
- Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Multi-turn Reinforcement Learning from Preference Human Feedback [41.327438095745315]
Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning Large Language Models with human preferences.
Existing methods work by emulating the preferences at the single decision (turn) level.
We develop novel methods for Reinforcement Learning from preference feedback between two full multi-turn conversations.
arXiv Detail & Related papers (2024-05-23T14:53:54Z) - Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data.
Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets.
We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - Pre-trained Recommender Systems: A Causal Debiasing Perspective [19.712997823535066]
We develop a generic recommender that captures universal interaction patterns by training on generic user-item interaction data extracted from different domains.
Our empirical studies show that the proposed model could significantly improve the recommendation performance in zero- and few-shot learning settings.
arXiv Detail & Related papers (2023-10-30T03:37:32Z) - COPR: Continual Learning Human Preference through Optimal Policy Regularization [32.54658750353585]
We propose a new method called Continual Optimal Policy Regularization (COPR)
COPR involves a single learning phase and doesn't necessitate complex reinforcement learning.
Our experimental results show that COPR outperforms strong Continuous Learning (CL) baselines.
arXiv Detail & Related papers (2023-10-24T10:05:32Z) - Enabling Language Models to Implicitly Learn Self-Improvement [49.16868302881804]
Large Language Models (LLMs) have demonstrated remarkable capabilities in open-ended text generation tasks.
We propose an ImPlicit Self-ImprovemenT (PIT) framework that implicitly learns the improvement goal from human preference data.
arXiv Detail & Related papers (2023-10-02T04:29:40Z) - Equivariant Data Augmentation for Generalization in Offline
Reinforcement Learning [10.00979536266327]
We present a novel approach to address the challenge of generalization in offline reinforcement learning (RL)
Specifically, we aim to improve the agent's ability to generalize to out-of-distribution goals.
We learn a new policy offline based on the augmented dataset, with an off-the-shelf offline RL algorithm.
arXiv Detail & Related papers (2023-09-14T10:22:33Z) - SURF: Semi-supervised Reward Learning with Data Augmentation for
Feedback-efficient Preference-based Reinforcement Learning [168.89470249446023]
We present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation.
In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor.
Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the preference-based method on a variety of locomotion and robotic manipulation tasks.
arXiv Detail & Related papers (2022-03-18T16:50:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.