Scalable Ensembling For Mitigating Reward Overoptimisation
- URL: http://arxiv.org/abs/2406.01013v2
- Date: Tue, 18 Jun 2024 20:53:08 GMT
- Title: Scalable Ensembling For Mitigating Reward Overoptimisation
- Authors: Ahmed M. Ahmed, Rafael Rafailov, Stepan Sharkov, Xuechen Li, Sanmi Koyejo,
- Abstract summary: Reinforcement Learning from Human Feedback has enabled significant advancements within language modeling for powerful, instruction-following models.
The alignment of these models remains a pressing challenge as the policy tends to overfit the learned proxy" reward model past an inflection point of utility.
- Score: 24.58937616758007
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning from Human Feedback (RLHF) has enabled significant advancements within language modeling for powerful, instruction-following models. However, the alignment of these models remains a pressing challenge as the policy tends to overfit the learned ``proxy" reward model past an inflection point of utility as measured by a ``gold" reward model that is more performant -- a phenomenon known as overoptimisation. Prior work has mitigated this issue by computing a pessimistic statistic over an ensemble of reward models, which is common in Offline Reinforcement Learning but incredibly costly for language models with high memory requirements, making such approaches infeasible for sufficiently large models. To this end, we propose using a shared encoder but separate linear heads. We find this leads to similar performance as the full ensemble while allowing tremendous savings in memory and time required for training for models of similar size.
Related papers
- Effects of Scale on Language Model Robustness [7.725206196110384]
We show that adversarially trained larger models generalize faster and better to modified attacks not seen during training when compared with smaller models.
We also analyze the offense/defense balance of increasing compute, finding parity in some settings and an advantage for offense in others.
arXiv Detail & Related papers (2024-07-25T17:26:41Z) - RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models.
The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety.
On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - Multi-timestep models for Model-based Reinforcement Learning [10.940666275830052]
In model-based reinforcement learning (MBRL), most algorithms rely on simulating trajectories from one-step dynamics models learned on data.
We tackle this issue by using a multi-timestep objective to train one-step models.
We find that exponentially decaying weights lead to models that significantly improve the long-horizon R2 score.
arXiv Detail & Related papers (2023-10-09T12:42:39Z) - Revisiting Implicit Models: Sparsity Trade-offs Capability in
Weight-tied Model for Vision Tasks [4.872984658007499]
Implicit models such as Deep Equilibrium Models (DEQs) have garnered significant attention in the community for their ability to train infinite layer models.
We revisit the line of implicit models and trace them back to the original weight-tied models.
Surprisingly, we observe that weight-tied models are more effective, stable, as well as efficient on vision tasks, compared to the DEQ variants.
arXiv Detail & Related papers (2023-07-16T11:45:35Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Investigating Ensemble Methods for Model Robustness Improvement of Text
Classifiers [66.36045164286854]
We analyze a set of existing bias features and demonstrate there is no single model that works best for all the cases.
By choosing an appropriate bias model, we can obtain a better robustness result than baselines with a more sophisticated model design.
arXiv Detail & Related papers (2022-10-28T17:52:10Z) - Dynamic Model Pruning with Feedback [64.019079257231]
We propose a novel model compression method that generates a sparse trained model without additional overhead.
We evaluate our method on CIFAR-10 and ImageNet, and show that the obtained sparse models can reach the state-of-the-art performance of dense models.
arXiv Detail & Related papers (2020-06-12T15:07:08Z) - When Ensembling Smaller Models is More Efficient than Single Large
Models [52.38997176317532]
We show that ensembles can outperform single models with both higher accuracy and requiring fewer total FLOPs to compute.
This presents an interesting observation that output diversity in ensembling can often be more efficient than training larger models.
arXiv Detail & Related papers (2020-05-01T18:56:18Z) - Modeling Survival in model-based Reinforcement Learning [0.0]
This work presents the notion of survival by discussing cases in which the agent's goal is to survive.
A substitute model for the reward function approxor is introduced that learns to avoid terminal states.
Focusing on terminal states, as a small fraction of state-space, reduces the training effort drastically.
arXiv Detail & Related papers (2020-04-18T15:49:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.