Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct
Preference Optimization
- URL: http://arxiv.org/abs/2310.03708v3
- Date: Fri, 15 Dec 2023 09:58:18 GMT
- Title: Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct
Preference Optimization
- Authors: Zhanhui Zhou, Jie Liu, Chao Yang, Jing Shao, Yu Liu, Xiangyu Yue,
Wanli Ouyang, Yu Qiao
- Abstract summary: We present Multi-Objective Direct Preference Optimization (MODPO) for multiple alignment objectives with minimal overheads.
MODPO folds language modeling directly into reward modeling, training LMs as implicit collective reward models (cRMs) that combine all objectives with specific weightings.
While theoretically guaranteed to produce the same optimal solutions as MORLHF, MODPO is practically more stable and computationally efficient.
- Score: 78.50294936259026
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A single language model (LM), despite aligning well with an average labeler
through reinforcement learning from human feedback (RLHF), may not universally
suit diverse human preferences. Recent approaches therefore opt for
customization by collecting multi-dimensional feedback and creating distinct
reward models (RMs) for each dimension (e.g., helpfulness, harmlessness, or
honesty). Different LMs can then be optimized for different preferences using
multi-objective RLHF (MORLHF) with different reward weightings. Yet, RL
fine-tuning is unstable and resource-heavy, especially for MORLHF with diverse
and usually conflicting objectives. In this paper, we present Multi-Objective
Direct Preference Optimization (MODPO), an RL-free algorithm that extends
Direct Preference Optimization (DPO) for multiple alignment objectives with
minimal overheads. Essentially, MODPO folds language modeling directly into
reward modeling, training LMs as implicit collective reward models (cRMs) that
combine all objectives with specific weightings. While theoretically guaranteed
to produce the same optimal solutions as MORLHF, MODPO is practically more
stable and computationally efficient. Empirical results from safety alignment
and long-form question answering confirm that MODPO matches or outperforms
existing methods, consistently producing a Pareto front of LMs that cater to
diverse preferences with 3 times less computational resources compared to
MORLHF.
Related papers
- Decoding-Time Language Model Alignment with Multiple Objectives [88.64776769490732]
Existing methods primarily focus on optimizing LMs for a single reward function, limiting their adaptability to varied objectives.
Here, we propose $textbfmulti-objective decoding (MOD)$, a decoding-time algorithm that outputs the next token from a linear combination of predictions.
We show why existing approaches can be sub-optimal even in natural settings and obtain optimality guarantees for our method.
arXiv Detail & Related papers (2024-06-27T02:46:30Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [90.4820014819937]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when finetuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, SELM significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - Hybrid Preference Optimization: Augmenting Direct Preference Optimization with Auxiliary Objectives [0.5120567378386615]
We propose a hybrid approach to aligning large language models (LLMs)
With a simple augmentation to the implicit reward decomposition of DPO, we allow for tuning LLMs to maximize a set of arbitrary auxiliary rewards.
The proposed method, Hybrid Preference Optimization (HPO), shows the ability to effectively generalize to both user preferences and auxiliary designer objectives.
arXiv Detail & Related papers (2024-05-28T08:35:48Z) - Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models.
The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models.
Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z) - SPO: Multi-Dimensional Preference Sequential Alignment With Implicit Reward Modeling [34.32744849352087]
We propose a method that sequentially fine-tunes large language models to align with human preferences.
We theoretically derive closed-form optimal SPO policy and loss function.
Empirical results on LLMs of different size and multiple evaluation datasets demonstrate that SPO successfully aligns LLMs across multiple dimensions of human preferences.
arXiv Detail & Related papers (2024-05-21T12:47:17Z) - Self-Play Preference Optimization for Language Model Alignment [75.83359213697854]
Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences.
We propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game.
Our approach, dubbed Self-Play Preference Optimization (SPPO), approximates the Nash equilibrium through iterative policy updates.
arXiv Detail & Related papers (2024-05-01T17:59:20Z) - Arithmetic Control of LLMs for Diverse User Preferences: Directional
Preference Alignment with Multi-Objective Rewards [32.799198549439716]
We introduce the Directional Preference Alignment (DPA) framework for aligning large language models (LLMs)
Unlike the scalar-reward RLHF, DPA incorporates multi-objective reward modeling to represent diverse preference profiles.
Our method provides straightforward arithmetic control over the trade-off between helpfulness and verbosity.
arXiv Detail & Related papers (2024-02-28T18:58:25Z) - Direct Preference Optimization: Your Language Model is Secretly a Reward
Model [126.78737228677025]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form.
The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight.
Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.