Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct
Preference Optimization
- URL: http://arxiv.org/abs/2310.03708v3
- Date: Fri, 15 Dec 2023 09:58:18 GMT
- Title: Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct
Preference Optimization
- Authors: Zhanhui Zhou, Jie Liu, Chao Yang, Jing Shao, Yu Liu, Xiangyu Yue,
Wanli Ouyang, Yu Qiao
- Abstract summary: We present Multi-Objective Direct Preference Optimization (MODPO) for multiple alignment objectives with minimal overheads.
MODPO folds language modeling directly into reward modeling, training LMs as implicit collective reward models (cRMs) that combine all objectives with specific weightings.
While theoretically guaranteed to produce the same optimal solutions as MORLHF, MODPO is practically more stable and computationally efficient.
- Score: 78.50294936259026
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A single language model (LM), despite aligning well with an average labeler
through reinforcement learning from human feedback (RLHF), may not universally
suit diverse human preferences. Recent approaches therefore opt for
customization by collecting multi-dimensional feedback and creating distinct
reward models (RMs) for each dimension (e.g., helpfulness, harmlessness, or
honesty). Different LMs can then be optimized for different preferences using
multi-objective RLHF (MORLHF) with different reward weightings. Yet, RL
fine-tuning is unstable and resource-heavy, especially for MORLHF with diverse
and usually conflicting objectives. In this paper, we present Multi-Objective
Direct Preference Optimization (MODPO), an RL-free algorithm that extends
Direct Preference Optimization (DPO) for multiple alignment objectives with
minimal overheads. Essentially, MODPO folds language modeling directly into
reward modeling, training LMs as implicit collective reward models (cRMs) that
combine all objectives with specific weightings. While theoretically guaranteed
to produce the same optimal solutions as MORLHF, MODPO is practically more
stable and computationally efficient. Empirical results from safety alignment
and long-form question answering confirm that MODPO matches or outperforms
existing methods, consistently producing a Pareto front of LMs that cater to
diverse preferences with 3 times less computational resources compared to
MORLHF.
Related papers
- Multi-Objective Preference Optimization: Improving Human Alignment of Generative Models [15.799929216215672]
We introduce the Multi-Objective Preference Optimization (MOPO) algorithm, which frames alignment as a constrained KL-regularized optimization.<n>Unlike prior work, MOPO operates directly on pairwise preference data, requires no point-wise reward assumption, and avoids prompt-context engineering.
arXiv Detail & Related papers (2025-05-16T05:58:26Z) - Robust Multi-Objective Preference Alignment with Online DPO [6.434799451791957]
Multi-objective preference alignment is critical for developing AI systems that are personalizable, helpful, and safe.
Existing approaches are either computationally expensive to train or do not sufficiently steer model behaviors.
This paper introduces the Multi-Objective Online DPO algorithm, designed to robustly and efficiently align model behaviors with multiple, potentially conflicting human preferences.
arXiv Detail & Related papers (2025-03-01T02:01:49Z) - MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment [14.541973333460149]
Mixing Preference Optimization (MPO) is a post-processing framework for aggregating single-objective policies.
MPO achieves balanced performance across diverse preferences, outperforming existing models with significantly reduced computational costs.
arXiv Detail & Related papers (2025-02-25T23:22:12Z) - Personalized Preference Fine-tuning of Diffusion Models [75.22218338096316]
We introduce PPD, a multi-reward optimization objective that aligns diffusion models with personalized preferences.
With PPD, a diffusion model learns the individual preferences of a population of users in a few-shot way.
Our approach achieves an average win rate of 76% over Stable Cascade, generating images that more accurately reflect specific user preferences.
arXiv Detail & Related papers (2025-01-11T22:38:41Z) - MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time [50.41806216615488]
Large Language Models (LLMs) acquire extensive knowledge and remarkable abilities from extensive text corpora.
To make LLMs more usable, aligning them with human preferences is essential.
We propose an effective method, textbf MetaAlign, which aims to help LLMs dynamically align with various explicit or implicit preferences specified at inference time.
arXiv Detail & Related papers (2024-10-18T05:31:13Z) - Decoding-Time Language Model Alignment with Multiple Objectives [116.42095026960598]
Existing methods primarily focus on optimizing LMs for a single reward function, limiting their adaptability to varied objectives.
Here, we propose $textbfmulti-objective decoding (MOD)$, a decoding-time algorithm that outputs the next token from a linear combination of predictions.
We show why existing approaches can be sub-optimal even in natural settings and obtain optimality guarantees for our method.
arXiv Detail & Related papers (2024-06-27T02:46:30Z) - mDPO: Conditional Preference Optimization for Multimodal Large Language Models [52.607764280030196]
Direct preference optimization (DPO) has shown to be an effective method for large language model (LLM) alignment.
Recent works have attempted to apply DPO to multimodal scenarios but have found it challenging to achieve consistent improvement.
We propose mDPO, a multimodal DPO objective that prevents the over-prioritization of language-only preferences by also optimizing image preference.
arXiv Detail & Related papers (2024-06-17T17:59:58Z) - Multi-objective Reinforcement learning from AI Feedback [0.0]
This paper presents a novel approach to improve the alignment and performance of language models trained using reinforcement learning from AI feedback (RLAIF)
In contrast to standard approaches that train a single preference model to represent all human preferences, MORLAIF decomposes this task into simpler principles, such as toxicity, factuality, and sycophancy.
Our experiments indicate that MORLAIF outperforms the standard RLAIF baselines and that MORLAIF can be used to align larger language models using smaller ones.
arXiv Detail & Related papers (2024-06-11T14:24:00Z) - Hybrid Preference Optimization: Augmenting Direct Preference Optimization with Auxiliary Objectives [0.5120567378386615]
We propose a hybrid approach to aligning large language models (LLMs)
With a simple augmentation to the implicit reward decomposition of DPO, we allow for tuning LLMs to maximize a set of arbitrary auxiliary rewards.
The proposed method, Hybrid Preference Optimization (HPO), shows the ability to effectively generalize to both user preferences and auxiliary designer objectives.
arXiv Detail & Related papers (2024-05-28T08:35:48Z) - Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models.
The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models.
Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z) - Personalized Soups: Personalized Large Language Model Alignment via
Post-hoc Parameter Merging [148.77027765872006]
We study Reinforcement Learning from Personalized Human Feedback (RLPHF) problem.
LLMs are aligned to multiple preferences by modeling alignment as a Multi-Objective Reinforcement Learning (MORL) problem.
We show that we can achieve personalized alignment by decomposing preferences into multiple dimensions.
arXiv Detail & Related papers (2023-10-17T20:22:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.