BPO: Staying Close to the Behavior LLM Creates Better Online LLM Alignment
- URL: http://arxiv.org/abs/2406.12168v4
- Date: Mon, 21 Oct 2024 18:00:54 GMT
- Title: BPO: Staying Close to the Behavior LLM Creates Better Online LLM Alignment
- Authors: Wenda Xu, Jiachen Li, William Yang Wang, Lei Li,
- Abstract summary: Direct alignment from preferences (DAP) has emerged as a promising paradigm for aligning large language models (LLMs) to human desiderata from pre-collected, offline preference datasets.
We highlight the need to develop specific online DAP algorithms to fully harness the power of online training.
- Score: 64.39433316922148
- License:
- Abstract: Direct alignment from preferences (DAP) has emerged as a promising paradigm for aligning large language models (LLMs) to human desiderata from pre-collected, offline preference datasets. While recent studies indicate that existing offline DAP methods can directly benefit from online training samples, we highlight the need to develop specific online DAP algorithms to fully harness the power of online training. Specifically, we identify that the learned LLM should adhere to the proximity of the behavior LLM, which collects the training samples. To this end, we propose online Preference Optimization in proximity to the Behavior LLM (BPO), emphasizing the importance of constructing a proper trust region for LLM alignment. We conduct extensive experiments to validate the effectiveness and applicability of our approach by integrating it with various DAP methods, resulting in significant performance improvements across a wide range of tasks when training with the same amount of preference data. Even when only introducing one additional data collection phase, our online BPO improves its offline DAP baseline from 72.0% to 80.2% on TL;DR and from 82.2% to 89.1% on Anthropic Helpfulness in terms of win rate against human reference text.
Related papers
- Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data.
Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z) - Weak-to-Strong Extrapolation Expedites Alignment [135.12769233630362]
We propose a method called ExPO to boost models' alignment with human preference.
We demonstrate that ExPO consistently improves off-the-shelf DPO/RLHF models.
We shed light on the essence of ExPO amplifying the reward signal learned during alignment training.
arXiv Detail & Related papers (2024-04-25T17:39:50Z) - Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model [3.300814846990438]
Large Language Models (LLMs) have become increasingly popular due to their ability to process and generate natural language.
As they are trained on massive datasets of text, LLMs can inherit harmful biases and produce outputs that are not aligned with human values.
This paper studies two main approaches to LLM alignment: Reinforcement Learning with Human Feedback (RLHF) and contrastive learning-based methods like Direct Preference Optimization (DPO)
By analyzing the stability and robustness of RLHF and DPO, we propose MPO, a novel method that mitigates the weaknesses of both approaches.
arXiv Detail & Related papers (2024-03-28T14:15:10Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Learning Goal-Conditioned Policies from Sub-Optimal Offline Data via Metric Learning [22.174803826742963]
We address the problem of learning optimal behavior from sub-optimal datasets for goal-conditioned offline reinforcement learning.
We propose the use of metric learning to approximate the optimal value function for goal-conditioned offline RL problems.
We show that our method estimates optimal behaviors from severely sub-optimal offline datasets without suffering from out-of-distribution estimation errors.
arXiv Detail & Related papers (2024-02-16T16:46:53Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - Direct Language Model Alignment from Online AI Feedback [78.40436231613754]
Direct alignment from preferences (DAP) methods have recently emerged as efficient alternatives to reinforcement learning from human feedback (RLHF)
In this study, we posit that online feedback is key and improves DAP methods.
Our method, online AI feedback (OAIF) uses an LLM as annotator: on each training, we sample two responses from the current model and prompt the LLM annotator to choose which one is preferred, thus providing online feedback.
arXiv Detail & Related papers (2024-02-07T12:31:13Z) - Using Offline Data to Speed-up Reinforcement Learning in Procedurally
Generated Environments [11.272582555795989]
We study whether agents can leverage offline data in the form of trajectories to improve the sample-efficiency in procedurally generated environments.
We consider two settings of using IL from offline data for RL: (1) pre-training a policy before online RL training and (2) concurrently training a policy with online RL and IL from offline data.
arXiv Detail & Related papers (2023-04-18T16:23:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.