Related papers: Mitigating the Alignment Tax of RLHF

Mitigating the Alignment Tax of RLHF

URL: http://arxiv.org/abs/2309.06256v3
Date: Mon, 5 Feb 2024 06:43:17 GMT
Title: Mitigating the Alignment Tax of RLHF
Authors: Yong Lin, Hangyu Lin, Wei Xiong, Shizhe Diao, Jianmeng Liu, Jipeng Zhang, Rui Pan, Haoxiang Wang, Wenbin Hu, Hanning Zhang, Hanze Dong, Renjie Pi, Han Zhao, Nan Jiang, Heng Ji, Yuan Yao, Tong Zhang
Abstract summary: Reinforcement Learning with Human Feedback (RLHF) can lead to, which is also known as the alignment tax. We propose model averaging, which interpolates between pre and post RLHF model weights, to achieve a more efficient reward-tax front.
Score: 77.7879015461373
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: LLMs acquire a wide range of abilities during pre-training, but aligning LLMs under Reinforcement Learning with Human Feedback (RLHF) can lead to forgetting, which is also known as the alignment tax. To empirically verify this hypothesis, we conducted experiments with existing RLHF algorithms using OpenLLaMA-3B, which revealed a pronounced alignment tax in NLP tasks. On the other hand, despite various techniques to mitigate forgetting, they are often at odds with the RLHF performance, leading to a trade-off between reward maximization and forgetting mitigation. In light of the above pressing issue in aligning LLMs, in this paper we explore model averaging, which interpolates between pre and post RLHF model weights, to achieve a more efficient reward-tax Pareto front. To understand its effectiveness, We offer theoretical insights into model averaging, revealing that it enhances performance Pareto front by increasing feature diversity on the layers where tasks share overlapped feature spaces. Empirical evidence corroborates our analysis by showing the benefits of averaging low-level transformer layers. Building on the analysis and the observation that averaging different layers of the transformer leads to significantly different reward-tax trade-offs, we propose Adaptive Model Averaging (AMA) to adaptively find various combination ratios of model layers. AMA seeks to maximize the alignment reward while incurring minimal alignment tax. Moreover, we validate AMA's performance across a range of RLHF algorithms over OpenLLaMA-3B and further extend our findings to Mistral-7B.

Related papers

The Hidden Link Between RLHF and Contrastive Learning [24.828596020853727]
We show that Reinforcement Learning from Human Feedback and Direct Preference Optimization can be interpreted from the perspective of mutual information.<n>Within this framework, both RLHF and DPO can be viewed as methods that perform contrastive learning.<n>Building on this perspective, we replace the DV/MINE bound with the Jensen-Shannon MI estimator and propose Mutual Information Optimization.
arXiv Detail & Related papers (2025-06-27T18:51:25Z)
Aligning to What? Limits to RLHF Based Alignment [2.624902795082451]
Reinforcement Learning from Human Feedback (RLHF) is increasingly used to align large language models with human preferences. This study investigates the relationship between RLHF and both covert and overt biases in large language models.
arXiv Detail & Related papers (2025-03-12T03:24:44Z)
Improving LLM General Preference Alignment via Optimistic Online Mirror Descent [57.622821649679786]
Reinforcement learning from human feedback (RLHF) has demonstrated remarkable effectiveness in aligning large language models (LLMs) with human preferences. In this paper, we drop the Bradley-Terry (BT) model assumption and study LLM alignment under general preferences, formulated as a two-player game. We show that our approach achieves an $O(T-1)$ bound on the duality gap, improving upon the previous $O(T-1/2)$ result.
arXiv Detail & Related papers (2025-02-24T05:24:52Z)
Simplify RLHF as Reward-Weighted SFT: A Variational Method [34.222095430239555]
Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning Large Language Models (LLMs) with human values. We propose a novel simplification of RLHF from the perspective of variational inference. We transform the alignment objective into a reward-driven supervised fine-tuning form to obtain noticeable improvement on training stability and effectiveness.
arXiv Detail & Related papers (2025-02-16T07:22:00Z)
Provably Efficient RLHF Pipeline: A Unified View from Contextual Bandits [59.30310692855397]
We propose a unified framework for the RLHF pipeline from the view of contextual bandits. We decompose the RLHF process into two distinct stages: (post-)training and deployment. We then develop novel algorithms for each stage, demonstrating significant improvements in both statistical and computational efficiency.
arXiv Detail & Related papers (2025-02-11T02:36:01Z)
SAIL: Self-Improving Efficient Online Alignment of Large Language Models [56.59644677997827]
Reinforcement Learning from Human Feedback is a key method for aligning large language models with human preferences. Recent literature has focused on designing online RLHF methods but still lacks a unified conceptual formulation. Our approach significantly improves alignment performance on open-sourced datasets with minimal computational overhead.
arXiv Detail & Related papers (2024-06-21T18:05:35Z)
Joint Demonstration and Preference Learning Improves Policy Alignment with Human Feedback [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy. The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms. We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z)
A Deep Dive into the Trade-Offs of Parameter-Efficient Preference Alignment Techniques [63.10251271444959]
Large language models are first pre-trained on trillions of tokens and then instruction-tuned or aligned to specific preferences. We conduct an in-depth investigation of the impact of popular choices for three crucial axes. Our setup spanning over 300 experiments reveals consistent trends and unexpected findings.
arXiv Detail & Related papers (2024-06-07T12:25:51Z)
Online Merging Optimizers for Boosting Rewards and Mitigating Tax in Alignment [47.682736928029996]
Large Language Models (LLMs) are designed to align with human-centric values while preventing the degradation of abilities acquired through Pre-training and Supervised Fine-tuning (SFT) In this paper, we show that interpolating RLHF and SFT model parameters can adjust the trade-off between human preference and basic capabilities, thereby reducing the alignment tax. It significantly enhances alignment reward while mitigating alignment tax, achieving higher overall performance across 14 benchmarks.
arXiv Detail & Related papers (2024-05-28T07:53:40Z)
On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization [33.331389392270665]
preference matching (PM) RLHF is a novel approach that aligns large language models with the preference distribution of the reward model under the Bradley--Terry--Luce/Plackett--Luce model. Central to our approach is a PM regularizer that takes the form of the negative logarithm of the LLM's policy probability distribution over responses. For practical implementation, we introduce a conditional variant of PM RLHF that is tailored to natural language generation.
arXiv Detail & Related papers (2024-05-26T07:00:05Z)
Bridging The Gap between Low-rank and Orthogonal Adaptation via Householder Reflection Adaptation [32.371755315509574]
Householder reflection adaptation (HRA) is a simple but effective adaptation method based on Householder reflections. HRA achieves superior performance with fewer learnable parameters when adapting large language models and conditional image generators.
arXiv Detail & Related papers (2024-05-24T16:18:16Z)
Understanding the Effects of RLHF on LLM Generalisation and Diversity [26.56388427640671]
Large language models (LLMs) fine-tuned with reinforcement learning from human feedback (RLHF) have been used in some of the most widely deployed AI models to date. We present an analysis of how each stage of the process affects two key properties: out-of-distribution (OOD) generalisation and output diversity.
arXiv Detail & Related papers (2023-10-10T09:25:44Z)
Supervised Hyperalignment for multi-subject fMRI data alignment [81.8694682249097]
This paper proposes a Supervised Hyperalignment (SHA) method to ensure better functional alignment for MVP analysis. Experiments on multi-subject datasets demonstrate that SHA method achieves up to 19% better performance for multi-class problems.
arXiv Detail & Related papers (2020-01-09T09:17:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.