Arithmetic Control of LLMs for Diverse User Preferences: Directional
Preference Alignment with Multi-Objective Rewards
- URL: http://arxiv.org/abs/2402.18571v3
- Date: Wed, 6 Mar 2024 08:07:02 GMT
- Title: Arithmetic Control of LLMs for Diverse User Preferences: Directional
Preference Alignment with Multi-Objective Rewards
- Authors: Haoxiang Wang, Yong Lin, Wei Xiong, Rui Yang, Shizhe Diao, Shuang Qiu,
Han Zhao, Tong Zhang
- Abstract summary: We introduce the Directional Preference Alignment (DPA) framework for aligning large language models (LLMs)
Unlike the scalar-reward RLHF, DPA incorporates multi-objective reward modeling to represent diverse preference profiles.
Our method provides straightforward arithmetic control over the trade-off between helpfulness and verbosity.
- Score: 32.799198549439716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-grained control over large language models (LLMs) remains a significant
challenge, hindering their adaptability to diverse user needs. While
Reinforcement Learning from Human Feedback (RLHF) shows promise in aligning
LLMs, its reliance on scalar rewards often limits its ability to capture
diverse user preferences in real-world applications. To address this
limitation, we introduce the Directional Preference Alignment (DPA) framework.
Unlike the scalar-reward RLHF, DPA incorporates multi-objective reward modeling
to represent diverse preference profiles. Additionally, DPA models user
preferences as directions (i.e., unit vectors) in the reward space to achieve
user-dependent preference control. Our method involves training a
multi-objective reward model and then fine-tuning the LLM with a
preference-conditioned variant of Rejection Sampling Finetuning (RSF), an RLHF
method adopted by Llama 2. This method enjoys a better performance trade-off
across various reward objectives. In comparison with the scalar-reward RLHF,
DPA offers users intuitive control over LLM generation: they can arithmetically
specify their desired trade-offs (e.g., more helpfulness with less verbosity).
We also validate the effectiveness of DPA with real-world alignment experiments
on Mistral-7B. Our method provides straightforward arithmetic control over the
trade-off between helpfulness and verbosity while maintaining competitive
performance with strong baselines such as Direct Preference Optimization (DPO).
Related papers
- Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [90.4820014819937]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when finetuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, SELM significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - Hybrid Preference Optimization: Augmenting Direct Preference Optimization with Auxiliary Objectives [0.5120567378386615]
We propose a hybrid approach to aligning large language models (LLMs)
With a simple augmentation to the implicit reward decomposition of DPO, we allow for tuning LLMs to maximize a set of arbitrary auxiliary rewards.
The proposed method, Hybrid Preference Optimization (HPO), shows the ability to effectively generalize to both user preferences and auxiliary designer objectives.
arXiv Detail & Related papers (2024-05-28T08:35:48Z) - Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models.
The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models.
Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z) - RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models [7.676477609461592]
Reinforcement learning from human feedback (RLHF) has been extensively employed to align large language models with user intent.
DPO relies on contrastive responses generated from human annotator and alternative LLM, instead of the policy model.
In this paper, we address both challenges by systematically combining sampling rejection (RS) and DPO.
Our proposed method effectively fine-tunes LLMs with limited resource environments, leading to improved alignment with user intent.
arXiv Detail & Related papers (2024-02-15T16:00:58Z) - Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts.
RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z) - On Diversified Preferences of Large Language Model Alignment [51.26149027399505]
We investigate the impact of diversified preferences on reward modeling.
We find that diversified preference data negatively affect the calibration performance of reward models.
We propose a novel Multi-Objective Reward learning method to enhance the calibration performance of RMs on shared preferences.
arXiv Detail & Related papers (2023-12-12T16:17:15Z) - Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct
Preference Optimization [78.50294936259026]
We present Multi-Objective Direct Preference Optimization (MODPO) for multiple alignment objectives with minimal overheads.
MODPO folds language modeling directly into reward modeling, training LMs as implicit collective reward models (cRMs) that combine all objectives with specific weightings.
While theoretically guaranteed to produce the same optimal solutions as MORLHF, MODPO is practically more stable and computationally efficient.
arXiv Detail & Related papers (2023-10-05T17:35:26Z) - Direct Preference Optimization: Your Language Model is Secretly a Reward
Model [126.78737228677025]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form.
The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight.
Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.