Related papers: Exploring the impact of low-rank adaptation on the performance, efficiency, and regularization of RLHF

Exploring the impact of low-rank adaptation on the performance, efficiency, and regularization of RLHF

URL: http://arxiv.org/abs/2309.09055v1
Date: Sat, 16 Sep 2023 17:31:36 GMT
Title: Exploring the impact of low-rank adaptation on the performance, efficiency, and regularization of RLHF
Authors: Simeng Sun, Dhawal Gupta, Mohit Iyyer
Abstract summary: We investigate an efficient implementation of RLHF using low-rank adaptation (LoRA) Our implementation achieves better performance than the publicly-released AlpacaFarm checkpoint with full model fine-tuning. We release our code and pretrained checkpoints to facilitate future research on more efficient RLHF.
Score: 47.960563851948514
License: http://creativecommons.org/licenses/by/4.0/
Abstract: During the last stage of RLHF, a large language model is aligned to human intents via PPO training, a process that generally requires large-scale computational resources. In this technical report, we empirically investigate an efficient implementation of RLHF using low-rank adaptation (LoRA), which allows us to align the LLaMA 7B checkpoint on the Alpaca dataset using only two A100 GPUs instead of the eight required for full model fine-tuning. Despite tuning only 0.2% of LLaMA 7B's parameters, our implementation achieves better performance than the publicly-released AlpacaFarm checkpoint with full model fine-tuning. Next, we analyze several configurations of our LoRA-based PPO implementation, varying the form of the KL regularization term in the training objective. We find that (1) removing this penalty term does not harm performance on the AlpacaFarm evaluation set under our LoRA setup; (2) other regularizers, such as Jensen-Shannon divergence, lead to improved performance; and (3) while PPO training negatively impacts the factuality of model-generated responses, training with LoRA largely mitigates this effect. We release our code and pretrained checkpoints to facilitate future research on more efficient RLHF.

Related papers

Does RLHF Scale? Exploring the Impacts From Data, Model, and Method [83.53178716807776]
This study explores the scaling properties of Reinforcement Learning from Human Feedback in Large Language Models. We analyze key components in the RLHF framework--model size, data composition, and inference budget--and their impacts on performance.
arXiv Detail & Related papers (2024-12-08T17:19:48Z)
Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs [75.11449420928139]
Fine-tuning Large Language Models (LLMs) has become a crucial technique for adapting pre-trained models to downstream tasks. Low-Rank Adaptation (LoRA) has emerged as a promising solution, but there exists a gap between the practical performance of low-rank adaptations and its theoretical optimum. We propose eXtreme Gradient Boosting LoRA, a novel framework that bridges this gap by leveraging the power of ensemble learning.
arXiv Detail & Related papers (2024-10-25T17:07:13Z)
Rapid Adaptation of Earth Observation Foundation Models for Segmentation [1.3654846342364308]
Low-Rank Adaptation (LoRA) can be used to fine-tune Earth Observation (EO) foundation models for flood segmentation. LoRA improves F1 score by 6.66 points and IoU by 0.11 compared to a frozen encoder baseline.
arXiv Detail & Related papers (2024-09-16T00:42:45Z)
ALoRA: Allocating Low-Rank Adaptation for Fine-tuning Large Language Models [8.251547772610301]
We extend the methodology of low-rank adaptation (LoRA) to an innovative approach we call allocating low-rank adaptation (ALoRA) First, we propose a novel method, AB-LoRA, that can effectively estimate the importance score of each LoRA rank. Second, guided by AB-LoRA, we gradually prune abundant and negatively impacting LoRA ranks and allocate the pruned LoRA budgets to important Transformer modules needing higher ranks.
arXiv Detail & Related papers (2024-03-24T15:09:55Z)
Parameter Efficient Reinforcement Learning from Human Feedback [27.687265760622918]
Reinforcement Learning from Human Feedback (RLHF) effectively aligns pretrained Large Language and Vision-Language Models with human preferences. To alleviate some of the computational burden of fine-tuning, efficient methods, like LoRA were introduced. We benchmark the PE-RLHF setup on six diverse datasets spanning summarization, harmless/helpful response generation, UI automation, and visual question answering.
arXiv Detail & Related papers (2024-03-15T21:43:46Z)
PRILoRA: Pruned and Rank-Increasing Low-Rank Adaptation [65.268245109828]
We introduce PRILoRA, which linearly allocates a different rank for each layer, in an increasing manner, and performs pruning throughout the training process. We validate the effectiveness of PRILoRA through extensive experiments on eight GLUE benchmarks, setting a new state of the art.
arXiv Detail & Related papers (2024-01-20T20:25:17Z)
Sparse Low-rank Adaptation of Pre-trained Language Models [79.74094517030035]
We introduce sparse low-rank adaptation (SoRA) that enables dynamic adjustments to the intrinsic rank during the adaptation process. Our approach strengthens the representation power of LoRA by initializing it with a higher rank, while efficiently taming a temporarily increased number of parameters. Our experimental results demonstrate that SoRA can outperform other baselines even with 70% retained parameters and 70% training time.
arXiv Detail & Related papers (2023-11-20T11:56:25Z)
Efficient RLHF: Reducing the Memory Usage of PPO [61.45357428856269]
We present a comprehensive analysis of the memory usage, performance, and training time of memory-savings techniques for PPO. We introduce Hydra-RLHF by first integrating the SFT and Reward models and then dynamically turning LoRA "off" during training. Our results demonstrate that Hydra-PPO is a simple and promising solution for enabling more widespread usage of RLHF.
arXiv Detail & Related papers (2023-09-01T22:57:20Z)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.