Mitigating the Alignment Tax of RLHF
- URL: http://arxiv.org/abs/2309.06256v4
- Date: Sun, 13 Oct 2024 19:27:20 GMT
- Title: Mitigating the Alignment Tax of RLHF
- Authors: Yong Lin, Hangyu Lin, Wei Xiong, Shizhe Diao, Jianmeng Liu, Jipeng Zhang, Rui Pan, Haoxiang Wang, Wenbin Hu, Hanning Zhang, Hanze Dong, Renjie Pi, Han Zhao, Nan Jiang, Heng Ji, Yuan Yao, Tong Zhang,
- Abstract summary: aligning LLMs under Reinforcement Learning with Human Feedback can lead to forgetting pretrained abilities, also known as the alignment tax.
We propose model averaging to maximize alignment performance while incurring minimal alignment tax.
We validate HMA's performance across a range of RLHF algorithms over OpenLLaMA-3B and further extend our findings to Mistral-7B.
- Score: 76.4300447532456
- License:
- Abstract: LLMs acquire a wide range of abilities during pre-training, but aligning LLMs under Reinforcement Learning with Human Feedback (RLHF) can lead to forgetting pretrained abilities, which is also known as the alignment tax. To investigate alignment tax, we conducted experiments with existing RLHF algorithms using OpenLLaMA-3B, which revealed a pronounced alignment tax in NLP tasks. Whereas, despite various techniques to mitigate forgetting, they are often at odds with the RLHF performance, leading to a trade-off between alignment performance and forgetting mitigation, leading to an alignment-forgetting trade-off. In this paper we show that model averaging, which simply interpolates between pre and post RLHF model weights, surprisingly achieves the most strongest alignment-forgetting Pareto front among a wide range of competing methods. To understand its effectiveness, we offer theoretical insights into model averaging, revealing that it enhances performance Pareto front by increasing feature diversity on the layers where tasks share overlapped feature spaces. Empirical evidence corroborates our analysis by showing the benefits of averaging low-level transformer layers. Building on the analysis and the observation that averaging different layers of the transformer leads to significantly different alignment-forgetting trade-offs, we propose Heterogeneous Model Averaging (HMA) to Heterogeneously find various combination ratios of model layers. HMA seeks to maximize the alignment performance while incurring minimal alignment tax. Moreover, we validate HMA's performance across a range of RLHF algorithms over OpenLLaMA-3B and further extend our findings to Mistral-7B which is evaluated by open-sourced preference model and GPT4. Code available here: https://github.com/avalonstrel/Mitigating-the-Alignment-Tax-of-RLHF.git.
Related papers
- SAIL: Self-Improving Efficient Online Alignment of Large Language Models [56.59644677997827]
Reinforcement Learning from Human Feedback is a key method for aligning large language models with human preferences.
Recent literature has focused on designing online RLHF methods but still lacks a unified conceptual formulation.
Our approach significantly improves alignment performance on open-sourced datasets with minimal computational overhead.
arXiv Detail & Related papers (2024-06-21T18:05:35Z) - Joint Demonstration and Preference Learning Improves Policy Alignment with Human Feedback [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy.
The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms.
We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z) - A Deep Dive into the Trade-Offs of Parameter-Efficient Preference Alignment Techniques [63.10251271444959]
Large language models are first pre-trained on trillions of tokens and then instruction-tuned or aligned to specific preferences.
We conduct an in-depth investigation of the impact of popular choices for three crucial axes.
Our setup spanning over 300 experiments reveals consistent trends and unexpected findings.
arXiv Detail & Related papers (2024-06-07T12:25:51Z) - Online Merging Optimizers for Boosting Rewards and Mitigating Tax in Alignment [47.682736928029996]
Large Language Models (LLMs) are designed to align with human-centric values while preventing the degradation of abilities acquired through Pre-training and Supervised Fine-tuning (SFT)
In this paper, we show that interpolating RLHF and SFT model parameters can adjust the trade-off between human preference and basic capabilities, thereby reducing the alignment tax.
It significantly enhances alignment reward while mitigating alignment tax, achieving higher overall performance across 14 benchmarks.
arXiv Detail & Related papers (2024-05-28T07:53:40Z) - On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization [33.331389392270665]
preference matching (PM) RLHF is a novel approach that aligns large language models with the preference distribution of the reward model under the Bradley--Terry--Luce/Plackett--Luce model.
Central to our approach is a PM regularizer that takes the form of the negative logarithm of the LLM's policy probability distribution over responses.
For practical implementation, we introduce a conditional variant of PM RLHF that is tailored to natural language generation.
arXiv Detail & Related papers (2024-05-26T07:00:05Z) - Bridging The Gap between Low-rank and Orthogonal Adaptation via Householder Reflection Adaptation [32.371755315509574]
Householder reflection adaptation (HRA) is a simple but effective adaptation method based on Householder reflections.
HRA achieves superior performance with fewer learnable parameters when adapting large language models and conditional image generators.
arXiv Detail & Related papers (2024-05-24T16:18:16Z) - Understanding the Effects of RLHF on LLM Generalisation and Diversity [26.56388427640671]
Large language models (LLMs) fine-tuned with reinforcement learning from human feedback (RLHF) have been used in some of the most widely deployed AI models to date.
We present an analysis of how each stage of the process affects two key properties: out-of-distribution (OOD) generalisation and output diversity.
arXiv Detail & Related papers (2023-10-10T09:25:44Z) - Supervised Hyperalignment for multi-subject fMRI data alignment [81.8694682249097]
This paper proposes a Supervised Hyperalignment (SHA) method to ensure better functional alignment for MVP analysis.
Experiments on multi-subject datasets demonstrate that SHA method achieves up to 19% better performance for multi-class problems.
arXiv Detail & Related papers (2020-01-09T09:17:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.