Fine-tuning Language Models with Generative Adversarial Reward Modelling
- URL: http://arxiv.org/abs/2305.06176v3
- Date: Tue, 5 Mar 2024 05:18:54 GMT
- Title: Fine-tuning Language Models with Generative Adversarial Reward Modelling
- Authors: Zhang Ze Yu, Lau Jia Jaw, Zhang Hui, Bryan Kian Hsiang Low
- Abstract summary: Reinforcement Learning with Human Feedback (RLHF) has been demonstrated to significantly enhance the performance of large language models (LLMs)
We propose another alternative approach: Reinforcement Learning with Generative Adversarial Feedback (RLGAF) to RLHF and SFT.
- Score: 30.424363135421917
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Reinforcement Learning with Human Feedback (RLHF) has been demonstrated to
significantly enhance the performance of large language models (LLMs) by
aligning their outputs with desired human values through instruction tuning.
However, RLHF is constrained by the expertise and productivity limitations of
human evaluators. A response to this downside is to fall back to supervised
fine-tuning (SFT) with additional carefully selected expert demonstrations.
However, while this method has been proven to be effective, it invariably also
leads to increased human-in-the-loop overhead. In this study, we propose
another alternative approach: Reinforcement Learning with Generative
Adversarial Feedback (RLGAF) to RLHF and SFT, which uses a generative
adversarial training style to enable the LLMs to learn useful human expert
demonstrations without being directly exposed to the training examples, thus
enabling good generalization capabilities while preserving sample efficiency.
Our preliminary findings indicate that RLGAF can help align LLMs outputs with
competitive performance against RLHF and SFT, while not suffering from their
respective inherent restrictions, suggesting promising avenues for further
research on automating AI alignment.
Related papers
- Prototypical Reward Network for Data-Efficient RLHF [17.220998116937444]
A reward model for Reinforcement Learning from Human Feedback (RLHF) has proven effective in fine-tuning Large Language Models (LLMs)
Our proposed framework Proto-RM leverages prototypical networks to enhance reward models under limited human feedback.
arXiv Detail & Related papers (2024-06-06T15:23:30Z) - Mixture of insighTful Experts (MoTE): The Synergy of Thought Chains and Expert Mixtures in Self-Alignment [103.05005690990271]
Traditional alignment strategies rely heavily on human intervention, such asSupervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF)
We propose a novel self-alignment method that utilizes a Chain of Thought (CoT) approach, termed AlignCoT.
We introduce the Mixture of insighTful Experts (MoTE) architecture, which applies mixture of experts to enhance each component of the AlignCoT process, markedly increasing alignment efficiency.
arXiv Detail & Related papers (2024-05-01T15:06:05Z) - RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs [49.386699863989335]
Training large language models (LLMs) to serve as effective assistants for humans requires careful consideration.
A promising approach is reinforcement learning from human feedback (RLHF), which leverages human feedback to update the model in accordance with human preferences.
In this paper, we analyze RLHF through the lens of reinforcement learning principles to develop an understanding of its fundamentals.
arXiv Detail & Related papers (2024-04-12T15:54:15Z) - CLHA: A Simple yet Effective Contrastive Learning Framework for Human Alignment [42.71324708567498]
Reinforcement learning from human feedback (RLHF) is a crucial technique in aligning large language models (LLMs) with human preferences.
We present a simple yet effective Contrastive Learning Framework for Human Alignment (CLHA) to align LLMs with human preferences directly.
arXiv Detail & Related papers (2024-03-25T11:37:15Z) - Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble [67.4269821365504]
Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values.
However, RLHF relies on a reward model that is trained with a limited amount of human preference data.
We contribute a reward ensemble method that allows the reward model to make more accurate predictions.
arXiv Detail & Related papers (2024-01-30T00:17:37Z) - InferAligner: Inference-Time Alignment for Harmlessness through
Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment.
Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics.
It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z) - Aligning Large Language Models with Human Preferences through Representation Engineering [41.81020951061438]
Drawing inspiration from the emerging field of representation engineering (RepE), this study aims to identify relevant representations for high-level human preferences embedded in patterns of activity within an LLM.
This novel approach, denoted as Representation Alignment from Human Feedback (RAHF), proves to be effective, computationally efficient, and easy to implement.
arXiv Detail & Related papers (2023-12-26T11:01:36Z) - SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision.
We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z) - RLAIF: Scaling Reinforcement Learning from Human Feedback with AI
Feedback [5.469395454378616]
Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences.
RL from AI Feedback (RLAIF) offers a promising alternative that leverages a powerful off-the-shelf LLM to generate preferences in lieu of human annotators.
Our results suggest that RLAIF can achieve human-level performance, offering a potential solution to the scalability limitations of RLHF.
arXiv Detail & Related papers (2023-09-01T05:53:33Z) - Direct Preference Optimization: Your Language Model is Secretly a Reward
Model [126.78737228677025]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form.
The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight.
Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.