Related papers: Aligning Language Models with Offline Learning from Human Feedback

Aligning Language Models with Offline Learning from Human Feedback

URL: http://arxiv.org/abs/2308.12050v2
Date: Sun, 10 Dec 2023 03:27:10 GMT
Title: Aligning Language Models with Offline Learning from Human Feedback
Authors: Jian Hu, Li Tao, June Yang, Chandler Zhou
Abstract summary: We propose an offline learning from human feedback framework to align language models without interacting with environments. Specifically, we explore filtering alignment (FA), reward-weighted regression (RWR), and conditional alignment (CA) to align language models to human preferences.
Score: 5.539080592071948
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Learning from human preferences is crucial for language models (LMs) to effectively cater to human needs and societal values. Previous research has made notable progress by leveraging human feedback to follow instructions. However, these approaches rely primarily on online learning techniques like Proximal Policy Optimization (PPO), which have been proven unstable and challenging to tune for language models. Moreover, PPO requires complex distributed system implementation, hindering the efficiency of large-scale distributed training. In this study, we propose an offline learning from human feedback framework to align LMs without interacting with environments. Specifically, we explore filtering alignment (FA), reward-weighted regression (RWR), and conditional alignment (CA) to align language models to human preferences. By employing a loss function similar to supervised fine-tuning, our methods ensure more stable model training than PPO with a simple machine learning system~(MLSys) and much fewer (around 9\%) computing resources. Experimental results demonstrate that conditional alignment outperforms other offline alignment methods and is comparable to PPO.

Related papers

Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss. The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z)
Joint Demonstration and Preference Learning Improves Policy Alignment with Human Feedback [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy. The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms. We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z)
Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF [82.7679132059169]
Reinforcement learning from human feedback has emerged as a central tool for language model alignment. We propose a new algorithm for online exploration in RLHF, Exploratory Preference Optimization (XPO) XPO enjoys the strongest known provable guarantees and promising empirical performance.
arXiv Detail & Related papers (2024-05-31T17:39:06Z)
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z)
Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model [3.300814846990438]
Large Language Models (LLMs) have become increasingly popular due to their ability to process and generate natural language. As they are trained on massive datasets of text, LLMs can inherit harmful biases and produce outputs that are not aligned with human values. This paper studies two main approaches to LLM alignment: Reinforcement Learning with Human Feedback (RLHF) and contrastive learning-based methods like Direct Preference Optimization (DPO) By analyzing the stability and robustness of RLHF and DPO, we propose MPO, a novel method that mitigates the weaknesses of both approaches.
arXiv Detail & Related papers (2024-03-28T14:15:10Z)
Is Crowdsourcing Breaking Your Bank? Cost-Effective Fine-Tuning of Pre-trained Language Models with Proximal Policy Optimization [18.75866961339424]
ChatGPT has highlighted the potential of reinforcement learning from human feedback. To reduce labor costs, we propose a self-supervised text ranking approach.
arXiv Detail & Related papers (2024-02-28T12:24:07Z)
Active Preference Learning for Large Language Models [12.093302163058436]
We develop an active learning strategy for DPO to make better use of preference labels. We propose a practical acquisition function for prompt/completion pairs based on the predictive entropy of the language model. We demonstrate how our approach improves both the rate of learning and final performance of fine-tuning on pairwise preference data.
arXiv Detail & Related papers (2024-02-12T23:09:00Z)
Direct Language Model Alignment from Online AI Feedback [78.40436231613754]
Direct alignment from preferences (DAP) methods have recently emerged as efficient alternatives to reinforcement learning from human feedback (RLHF) In this study, we posit that online feedback is key and improves DAP methods. Our method, online AI feedback (OAIF) uses an LLM as annotator: on each training, we sample two responses from the current model and prompt the LLM annotator to choose which one is preferred, thus providing online feedback.
arXiv Detail & Related papers (2024-02-07T12:31:13Z)
Nash Learning from Human Feedback [86.09617990412941]
We introduce an alternative pipeline for the fine-tuning of large language models using pairwise human feedback. We term this approach Nash learning from human feedback (NLHF) We present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent.
arXiv Detail & Related papers (2023-12-01T19:26:23Z)
Sample Efficient Preference Alignment in LLMs via Active Exploration [63.84454768573154]
We take advantage of the fact that one can often choose contexts at which to obtain human feedback to most efficiently identify a good policy. We propose an active exploration algorithm to efficiently select the data and provide theoretical proof that it has a worst-case regret bound. Our method outperforms the baselines with limited samples of human preferences on several language models and four real-world datasets.
arXiv Detail & Related papers (2023-12-01T00:54:02Z)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.