OPTune: Efficient Online Preference Tuning
- URL: http://arxiv.org/abs/2406.07657v1
- Date: Tue, 11 Jun 2024 18:55:04 GMT
- Title: OPTune: Efficient Online Preference Tuning
- Authors: Lichang Chen, Jiuhai Chen, Chenxi Liu, John Kirchenbauer, Davit Soselia, Chen Zhu, Tom Goldstein, Tianyi Zhou, Heng Huang,
- Abstract summary: We propose a more efficient data exploration strategy for online preference tuning (OPTune)
OPTune dynamically samples informative responses for on-policy preference alignment.
In our evaluations, OPTune'd LLMs enjoy 1.27-1.56x faster training speed due to the efficient data exploration strategy.
- Score: 107.44836901099
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning with human feedback~(RLHF) is critical for aligning Large Language Models (LLMs) with human preference. Compared to the widely studied offline version of RLHF, \emph{e.g.} direct preference optimization (DPO), recent works have shown that the online variants achieve even better alignment. However, online alignment requires on-the-fly generation of new training data, which is costly, hard to parallelize, and suffers from varying quality and utility. In this paper, we propose a more efficient data exploration strategy for online preference tuning (OPTune), which does not rely on human-curated or pre-collected teacher responses but dynamically samples informative responses for on-policy preference alignment. During data generation, OPTune only selects prompts whose (re)generated responses can potentially provide more informative and higher-quality training signals than the existing responses. In the training objective, OPTune reweights each generated response (pair) by its utility in improving the alignment so that learning can be focused on the most helpful samples. Throughout our evaluations, OPTune'd LLMs maintain the instruction-following benefits provided by standard preference tuning whilst enjoying 1.27-1.56x faster training speed due to the efficient data exploration strategy.
Related papers
- Optimizing Preference Alignment with Differentiable NDCG Ranking [9.594183083553245]
Recent studies have uncovered a substantial discrepancy between the theoretical aspirations of preference learning and its real-world results.
This paper introduces underlineDirect underlineRanking underlinePreference underlineOptimization (O), a novel method that views human preference alignment as a Learning-to-Rank task.
arXiv Detail & Related papers (2024-10-17T08:54:57Z) - Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [63.32585910975191]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.
We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z) - REAL: Response Embedding-based Alignment for LLMs [1.9513983244114355]
We propose a strategy for sampling a high-quality training dataset that focuses on acquiring the most informative response pairs.
Experimental results indicate that choosing dissimilar response pairs enhances the direct alignment of LLMs.
Our findings suggest that focusing on less similar pairs can improve the efficiency of LLM alignment, saving up to 65% of annotators' work.
arXiv Detail & Related papers (2024-09-17T22:40:54Z) - Reward Difference Optimization For Sample Reweighting In Offline RLHF [18.62836654699957]
Current offline RLHF only captures the "ordinal relationship" between responses, overlooking the crucial aspect of how much one is preferred over the others.
We propose a simple yet effective solution called Reward Difference Optimization, shorted as RDO.
Experiments with 7B LLMs on the HH and TL;DR datasets substantiate the effectiveness of our method in both automatic metrics and human evaluation.
arXiv Detail & Related papers (2024-08-18T07:04:16Z) - SAIL: Self-Improving Efficient Online Alignment of Large Language Models [56.59644677997827]
Reinforcement Learning from Human Feedback is a key method for aligning large language models with human preferences.
Recent literature has focused on designing online RLHF methods but still lacks a unified conceptual formulation.
Our approach significantly improves alignment performance on open-sourced datasets with minimal computational overhead.
arXiv Detail & Related papers (2024-06-21T18:05:35Z) - Online Self-Preferring Language Models [34.22412851864247]
Online Self-Preferring (OSP) language models learn from self-generated response pairs and self-judged preference strengths.
OSP achieves state-of-the-art alignment performance across various metrics in two widely used human preference datasets.
arXiv Detail & Related papers (2024-05-23T02:13:34Z) - Direct Language Model Alignment from Online AI Feedback [78.40436231613754]
Direct alignment from preferences (DAP) methods have recently emerged as efficient alternatives to reinforcement learning from human feedback (RLHF)
In this study, we posit that online feedback is key and improves DAP methods.
Our method, online AI feedback (OAIF) uses an LLM as annotator: on each training, we sample two responses from the current model and prompt the LLM annotator to choose which one is preferred, thus providing online feedback.
arXiv Detail & Related papers (2024-02-07T12:31:13Z) - Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback [70.32795295142648]
Linear alignment is a novel algorithm that aligns language models with human preferences in one single inference step.
Experiments on both general and personalized preference datasets demonstrate that linear alignment significantly enhances the performance and efficiency of LLM alignment.
arXiv Detail & Related papers (2024-01-21T10:46:23Z) - Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form.
The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight.
Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.