Controllable Preference Optimization: Toward Controllable
Multi-Objective Alignment
- URL: http://arxiv.org/abs/2402.19085v1
- Date: Thu, 29 Feb 2024 12:12:30 GMT
- Title: Controllable Preference Optimization: Toward Controllable
Multi-Objective Alignment
- Authors: Yiju Guo, Ganqu Cui, Lifan Yuan, Ning Ding, Jiexin Wang, Huimin Chen,
Bowen Sun, Ruobing Xie, Jie Zhou, Yankai Lin, Zhiyuan Liu, Maosong Sun
- Abstract summary: Alignment in artificial intelligence pursues consistency between model responses and human preferences as well as values.
Existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives.
We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives.
- Score: 107.63756895544842
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Alignment in artificial intelligence pursues the consistency between model
responses and human preferences as well as values. In practice, the
multifaceted nature of human preferences inadvertently introduces what is known
as the "alignment tax" -a compromise where enhancements in alignment within one
objective (e.g.,harmlessness) can diminish performance in others
(e.g.,helpfulness). However, existing alignment techniques are mostly
unidirectional, leading to suboptimal trade-offs and poor flexibility over
various objectives. To navigate this challenge, we argue the prominence of
grounding LLMs with evident preferences. We introduce controllable preference
optimization (CPO), which explicitly specifies preference scores for different
objectives, thereby guiding the model to generate responses that meet the
requirements. Our experimental analysis reveals that the aligned models can
provide responses that match various preferences among the "3H" (helpfulness,
honesty, harmlessness) desiderata. Furthermore, by introducing diverse data and
alignment goals, we surpass baseline methods in aligning with single
objectives, hence mitigating the impact of the alignment tax and achieving
Pareto improvements in multi-objective alignment.
Related papers
- Pareto-Optimal Learning from Preferences with Hidden Context [18.340302968130683]
We propose POPL, which frames discrepant group preferences as objectives with potential trade-offs.
Our empirical evaluations demonstrate that POPL surpasses baseline methods in learning sets of reward functions.
POPL can serve as a foundation for techniques optimizing specific notions of group fairness.
arXiv Detail & Related papers (2024-06-21T18:57:38Z) - Hybrid Alignment Training for Large Language Models [60.46220684809339]
Alignment training is crucial for enabling large language models to cater to human intentions and preferences.
We propose a Hybrid Alignment Training (Hbat) approach, based on alternating alignment and modified elastic weight consolidation methods.
Experimental results show that the proposed textscHbat can significantly outperform all baselines.
arXiv Detail & Related papers (2024-06-21T14:23:57Z) - Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data [102.16105233826917]
Learning from preference labels plays a crucial role in fine-tuning large language models.
There are several distinct approaches for preference fine-tuning, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning.
arXiv Detail & Related papers (2024-04-22T17:20:18Z) - Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment [46.44464839353993]
We introduce Rewards-in-Context (RiC), which conditions the response of a foundation model on multiple rewards in its prompt context.
RiC only requires supervised fine-tuning of a single foundation model and supports dynamic adjustment for user preferences during inference time.
arXiv Detail & Related papers (2024-02-15T18:58:31Z) - Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback [70.32795295142648]
Linear alignment is a novel algorithm that aligns language models with human preferences in one single inference step.
Experiments on both general and personalized preference datasets demonstrate that linear alignment significantly enhances the performance and efficiency of LLM alignment.
arXiv Detail & Related papers (2024-01-21T10:46:23Z) - Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct
Preference Optimization [78.50294936259026]
We present Multi-Objective Direct Preference Optimization (MODPO) for multiple alignment objectives with minimal overheads.
MODPO folds language modeling directly into reward modeling, training LMs as implicit collective reward models (cRMs) that combine all objectives with specific weightings.
While theoretically guaranteed to produce the same optimal solutions as MORLHF, MODPO is practically more stable and computationally efficient.
arXiv Detail & Related papers (2023-10-05T17:35:26Z) - From Instructions to Intrinsic Human Values -- A Survey of Alignment
Goals for Big Models [48.326660953180145]
We conduct a survey of different alignment goals in existing work and trace their evolution paths to help identify the most essential goal.
Our analysis reveals a goal transformation from fundamental abilities to value orientation, indicating the potential of intrinsic human values as the alignment goal for enhanced LLMs.
arXiv Detail & Related papers (2023-08-23T09:11:13Z) - An Approach to Ordering Objectives and Pareto Efficient Solutions [0.0]
Solutions to multi-objective optimization problems can generally not be compared or ordered.
Decision-makers are often made to believe that scaled objectives can be compared.
We present a method that uses the probability integral transform in order to map the objectives of a problem into scores that all share the same range.
arXiv Detail & Related papers (2022-05-30T17:55:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.