Exploring the impact of low-rank adaptation on the performance,
efficiency, and regularization of RLHF
- URL: http://arxiv.org/abs/2309.09055v1
- Date: Sat, 16 Sep 2023 17:31:36 GMT
- Title: Exploring the impact of low-rank adaptation on the performance,
efficiency, and regularization of RLHF
- Authors: Simeng Sun, Dhawal Gupta, Mohit Iyyer
- Abstract summary: We investigate an efficient implementation of RLHF using low-rank adaptation (LoRA)
Our implementation achieves better performance than the publicly-released AlpacaFarm checkpoint with full model fine-tuning.
We release our code and pretrained checkpoints to facilitate future research on more efficient RLHF.
- Score: 47.960563851948514
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: During the last stage of RLHF, a large language model is aligned to human
intents via PPO training, a process that generally requires large-scale
computational resources. In this technical report, we empirically investigate
an efficient implementation of RLHF using low-rank adaptation (LoRA), which
allows us to align the LLaMA 7B checkpoint on the Alpaca dataset using only two
A100 GPUs instead of the eight required for full model fine-tuning. Despite
tuning only 0.2% of LLaMA 7B's parameters, our implementation achieves better
performance than the publicly-released AlpacaFarm checkpoint with full model
fine-tuning. Next, we analyze several configurations of our LoRA-based PPO
implementation, varying the form of the KL regularization term in the training
objective. We find that (1) removing this penalty term does not harm
performance on the AlpacaFarm evaluation set under our LoRA setup; (2) other
regularizers, such as Jensen-Shannon divergence, lead to improved performance;
and (3) while PPO training negatively impacts the factuality of model-generated
responses, training with LoRA largely mitigates this effect. We release our
code and pretrained checkpoints to facilitate future research on more efficient
RLHF.
Related papers
- Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning [55.65738319966385]
We propose a novel algorithm, iterative Nash policy optimization (INPO)
Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses.
With an LLaMA-3-8B-based SFT model, INPO achieves a 41.5% length-controlled win rate on AlpacaEval 2.0 and a 38.3% win rate on Arena-Hard.
arXiv Detail & Related papers (2024-06-30T08:00:34Z) - Bootstrapping Language Models with DPO Implicit Rewards [45.68366127605774]
Direct preference optimization (DPO) has greatly simplified the process from past work in reinforcement learning from human feedback.
In this work, we make a novel observation that this implicit reward model can by itself be used in a bootstrapping fashion to further align the LLM.
Our approach, named self-alignment with DPO ImpliCit rEwards (DICE), shows great improvements in alignment and achieves superior performance.
arXiv Detail & Related papers (2024-06-14T06:57:18Z) - ALoRA: Allocating Low-Rank Adaptation for Fine-tuning Large Language Models [8.251547772610301]
We extend the methodology of low-rank adaptation (LoRA) to an innovative approach we call allocating low-rank adaptation (ALoRA)
First, we propose a novel method, AB-LoRA, that can effectively estimate the importance score of each LoRA rank.
Second, guided by AB-LoRA, we gradually prune abundant and negatively impacting LoRA ranks and allocate the pruned LoRA budgets to important Transformer modules needing higher ranks.
arXiv Detail & Related papers (2024-03-24T15:09:55Z) - PERL: Parameter Efficient Reinforcement Learning from Human Feedback [27.687265760622918]
Reinforcement Learning from Human Feedback (RLHF) has proven to be a strong method to align Large Language Models with human preferences.
We study RLHF where the underlying models are trained using the parameter efficient method of Low-Rank Adaptation (LoRA) introduced by Hu et al.
We find that PERL performs on par with the conventional RLHF setting, while training faster, and with less memory.
arXiv Detail & Related papers (2024-03-15T21:43:46Z) - PRILoRA: Pruned and Rank-Increasing Low-Rank Adaptation [65.268245109828]
We introduce PRILoRA, which linearly allocates a different rank for each layer, in an increasing manner, and performs pruning throughout the training process.
We validate the effectiveness of PRILoRA through extensive experiments on eight GLUE benchmarks, setting a new state of the art.
arXiv Detail & Related papers (2024-01-20T20:25:17Z) - Sparse Low-rank Adaptation of Pre-trained Language Models [79.74094517030035]
We introduce sparse low-rank adaptation (SoRA) that enables dynamic adjustments to the intrinsic rank during the adaptation process.
Our approach strengthens the representation power of LoRA by initializing it with a higher rank, while efficiently taming a temporarily increased number of parameters.
Our experimental results demonstrate that SoRA can outperform other baselines even with 70% retained parameters and 70% training time.
arXiv Detail & Related papers (2023-11-20T11:56:25Z) - Mitigating the Alignment Tax of RLHF [77.7879015461373]
Reinforcement Learning with Human Feedback (RLHF) can lead to, which is also known as the alignment tax.
We propose model averaging, which interpolates between pre and post RLHF model weights, to achieve a more efficient reward-tax front.
arXiv Detail & Related papers (2023-09-12T14:16:54Z) - Efficient RLHF: Reducing the Memory Usage of PPO [61.45357428856269]
We present a comprehensive analysis of the memory usage, performance, and training time of memory-savings techniques for PPO.
We introduce Hydra-RLHF by first integrating the SFT and Reward models and then dynamically turning LoRA "off" during training.
Our results demonstrate that Hydra-PPO is a simple and promising solution for enabling more widespread usage of RLHF.
arXiv Detail & Related papers (2023-09-01T22:57:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.