Arithmetic Control of LLMs for Diverse User Preferences: Directional
Preference Alignment with Multi-Objective Rewards
- URL: http://arxiv.org/abs/2402.18571v3
- Date: Wed, 6 Mar 2024 08:07:02 GMT
- Title: Arithmetic Control of LLMs for Diverse User Preferences: Directional
Preference Alignment with Multi-Objective Rewards
- Authors: Haoxiang Wang, Yong Lin, Wei Xiong, Rui Yang, Shizhe Diao, Shuang Qiu,
Han Zhao, Tong Zhang
- Abstract summary: We introduce the Directional Preference Alignment (DPA) framework for aligning large language models (LLMs)
Unlike the scalar-reward RLHF, DPA incorporates multi-objective reward modeling to represent diverse preference profiles.
Our method provides straightforward arithmetic control over the trade-off between helpfulness and verbosity.
- Score: 32.799198549439716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-grained control over large language models (LLMs) remains a significant
challenge, hindering their adaptability to diverse user needs. While
Reinforcement Learning from Human Feedback (RLHF) shows promise in aligning
LLMs, its reliance on scalar rewards often limits its ability to capture
diverse user preferences in real-world applications. To address this
limitation, we introduce the Directional Preference Alignment (DPA) framework.
Unlike the scalar-reward RLHF, DPA incorporates multi-objective reward modeling
to represent diverse preference profiles. Additionally, DPA models user
preferences as directions (i.e., unit vectors) in the reward space to achieve
user-dependent preference control. Our method involves training a
multi-objective reward model and then fine-tuning the LLM with a
preference-conditioned variant of Rejection Sampling Finetuning (RSF), an RLHF
method adopted by Llama 2. This method enjoys a better performance trade-off
across various reward objectives. In comparison with the scalar-reward RLHF,
DPA offers users intuitive control over LLM generation: they can arithmetically
specify their desired trade-offs (e.g., more helpfulness with less verbosity).
We also validate the effectiveness of DPA with real-world alignment experiments
on Mistral-7B. Our method provides straightforward arithmetic control over the
trade-off between helpfulness and verbosity while maintaining competitive
performance with strong baselines such as Direct Preference Optimization (DPO).
Related papers
- Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss.
The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - Hybrid Preference Optimization: Augmenting Direct Preference Optimization with Auxiliary Objectives [0.5120567378386615]
We propose a hybrid approach to aligning large language models (LLMs)
With a simple augmentation to the implicit reward decomposition of DPO, we allow for tuning LLMs to maximize a set of arbitrary auxiliary rewards.
The proposed method, Hybrid Preference Optimization (HPO), shows the ability to effectively generalize to both user preferences and auxiliary designer objectives.
arXiv Detail & Related papers (2024-05-28T08:35:48Z) - Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models.
The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models.
Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z) - RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models [7.676477609461592]
Reinforcement learning from human feedback (RLHF) has been extensively employed to align large language models with user intent.
DPO relies on contrastive responses generated from human annotator and alternative LLM, instead of the policy model.
In this paper, we address both challenges by systematically combining sampling rejection (RS) and DPO.
Our proposed method effectively fine-tunes LLMs with limited resource environments, leading to improved alignment with user intent.
arXiv Detail & Related papers (2024-02-15T16:00:58Z) - Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts.
RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z) - Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization [76.09576643028362]
We present Multi-Objective Direct Preference Optimization (MODPO) for multiple alignment objectives.
MODPO folds language modeling directly into reward modeling, training language models as implicit collective reward models.
It theoretically yields the same optimal solutions as MORLHF but is practically more stable and efficient.
arXiv Detail & Related papers (2023-10-05T17:35:26Z) - Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form.
The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight.
Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z) - Reinforcement Learning from Diverse Human Preferences [68.4294547285359]
This paper develops a method for crowd-sourcing preference labels and learning from diverse human preferences.
The proposed method is tested on a variety of tasks in DMcontrol and Meta-world.
It has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback.
arXiv Detail & Related papers (2023-01-27T15:18:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.