Proxy-RLHF: Decoupling Generation and Alignment in Large Language Model
with Proxy
- URL: http://arxiv.org/abs/2403.04283v1
- Date: Thu, 7 Mar 2024 07:31:00 GMT
- Title: Proxy-RLHF: Decoupling Generation and Alignment in Large Language Model
with Proxy
- Authors: Yu Zhu, Chuxiong Sun, Wenfei Yang, Wenqiang Wei, Bo Tang, Tianzhu
Zhang, Zhiyu Li, Shifeng Zhang, Feiyu Xiong, Jie Hu, Mingchuan yang
- Abstract summary: Reinforcement Learning from Human Feedback (RLHF) is the prevailing approach to ensure Large Language Models (LLMs) align with human values.
We introduce Proxy-RLHF, which decouples the generation and alignment processes of LLMs.
Our method achieves a comparable level of alignment with only 1% of the training parameters of other methods.
- Score: 47.327200425168314
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning from Human Feedback (RLHF) is the prevailing approach
to ensure Large Language Models (LLMs) align with human values. However,
existing RLHF methods require a high computational cost, one main reason being
that RLHF assigns both the generation and alignment tasks to the LLM
simultaneously. In this paper, we introduce Proxy-RLHF, which decouples the
generation and alignment processes of LLMs, achieving alignment with human
values at a much lower computational cost. We start with a novel Markov
Decision Process (MDP) designed for the alignment process and employ
Reinforcement Learning (RL) to train a streamlined proxy model that oversees
the token generation of the LLM, without altering the LLM itself. Experiments
show that our method achieves a comparable level of alignment with only 1\% of
the training parameters of other methods.
Related papers
- SAIL: Self-Improving Efficient Online Alignment of Large Language Models [56.59644677997827]
Reinforcement Learning from Human Feedback is a key method for aligning large language models with human preferences.
Recent literature has focused on designing online RLHF methods but still lacks a unified conceptual formulation.
Our approach significantly improves alignment performance on open-sourced datasets with minimal computational overhead.
arXiv Detail & Related papers (2024-06-21T18:05:35Z) - ReaLHF: Optimized RLHF Training for Large Language Models through Parameter Reallocation [12.321332446941378]
Reinforcement Learning from Human Feedback (RLHF) stands as a pivotal technique in empowering large language model (LLM) applications.
We propose a novel approach named parameter ReaLlocation, which dynamically redistributes LLM parameters in the cluster.
We introduce ReaLHF, a pioneering system capable of automatically discovering and running efficient execution plans for RLHF training.
arXiv Detail & Related papers (2024-06-20T08:04:07Z) - Joint Demonstration and Preference Learning Improves Policy Alignment with Human Feedback [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy.
The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms.
We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z) - Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model [3.300814846990438]
Large Language Models (LLMs) have become increasingly popular due to their ability to process and generate natural language.
As they are trained on massive datasets of text, LLMs can inherit harmful biases and produce outputs that are not aligned with human values.
This paper studies two main approaches to LLM alignment: Reinforcement Learning with Human Feedback (RLHF) and contrastive learning-based methods like Direct Preference Optimization (DPO)
By analyzing the stability and robustness of RLHF and DPO, we propose MPO, a novel method that mitigates the weaknesses of both approaches.
arXiv Detail & Related papers (2024-03-28T14:15:10Z) - TeaMs-RL: Teaching LLMs to Generate Better Instruction Datasets via Reinforcement Learning [7.9961739811640244]
Development of Large Language Models often confronts challenges stemming from heavy reliance on human annotators.
In this work, we pivot to Reinforcement Learning -- but with a twist.
We use RL to directly generate the foundational instruction dataset that alone suffices for fine-tuning.
arXiv Detail & Related papers (2024-03-13T16:57:57Z) - Teaching Large Language Models to Reason with Reinforcement Learning [38.17625148525193]
Reinforcement Learning from Human Feedback (textbfRLHF) has emerged as a dominant approach for aligning LLM outputs with human preferences.
Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback.
arXiv Detail & Related papers (2024-03-07T16:36:29Z) - SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision.
We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z) - Reinforced Self-Training (ReST) for Language Modeling [56.75447441157628]
Reinforcement learning from human feedback (RLHF) can improve the quality of large language model's (LLM) outputs by aligning them with human preferences.
We propose a simple algorithm for aligning LLMs with human preferences inspired by growing batch reinforcement learning (RL), which we call Reinforced Self-Training (ReST)
Our results show that ReST can substantially improve translation quality, as measured by automated metrics and human evaluation on machine translation benchmarks in a compute and sample-efficient manner.
arXiv Detail & Related papers (2023-08-17T14:12:48Z) - Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form.
The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight.
Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.