Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment
- URL: http://arxiv.org/abs/2402.19085v3
- Date: Fri, 11 Oct 2024 08:21:50 GMT
- Title: Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment
- Authors: Yiju Guo, Ganqu Cui, Lifan Yuan, Ning Ding, Zexu Sun, Bowen Sun, Huimin Chen, Ruobing Xie, Jie Zhou, Yankai Lin, Zhiyuan Liu, Maosong Sun,
- Abstract summary: Alignment in artificial intelligence pursues consistency between model responses and human preferences as well as values.
Existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives.
We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives.
- Score: 103.12563033438715
- License:
- Abstract: Alignment in artificial intelligence pursues the consistency between model responses and human preferences as well as values. In practice, the multifaceted nature of human preferences inadvertently introduces what is known as the "alignment tax" -a compromise where enhancements in alignment within one objective (e.g.,harmlessness) can diminish performance in others (e.g.,helpfulness). However, existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives. To navigate this challenge, we argue the prominence of grounding LLMs with evident preferences. We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives, thereby guiding the model to generate responses that meet the requirements. Our experimental analysis reveals that the aligned models can provide responses that match various preferences among the "3H" (helpfulness, honesty, harmlessness) desiderata. Furthermore, by introducing diverse data and alignment goals, we surpass baseline methods in aligning with single objectives, hence mitigating the impact of the alignment tax and achieving improvements in multi-objective alignment.
Related papers
- MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time [50.41806216615488]
Large Language Models (LLMs) acquire extensive knowledge and remarkable abilities from extensive text corpora.
To make LLMs more usable, aligning them with human preferences is essential.
We propose an effective method, textbf MetaAlign, which aims to help LLMs dynamically align with various explicit or implicit preferences specified at inference time.
arXiv Detail & Related papers (2024-10-18T05:31:13Z) - Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment [57.03947082589616]
Large Language Models (LLMs) are often aligned using contrastive alignment objectives and preference pair datasets.
We study this and find that preference data gives a better learning signal when the underlying responses are contrastive.
We introduce Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs.
Our best model, trained on 32K CLAIR preferences with APO, improves Llama-3-8B-Instruct by 7.65%, closing the gap with GPT4-turbo by 45%.
arXiv Detail & Related papers (2024-08-12T16:24:51Z) - Unlocking Decoding-time Controllability: Gradient-Free Multi-Objective Alignment with Contrastive Prompts [38.95012734839997]
Multi-objective alignment aims at balancing and controlling the different alignment objectives of large language models.
We propose MCA (Multi-objective Contrastive Alignemnt), which constructs an expert prompt and an adversarial prompt for each objective to contrast.
arXiv Detail & Related papers (2024-08-09T14:36:42Z) - Hybrid Alignment Training for Large Language Models [60.46220684809339]
Alignment training is crucial for enabling large language models to cater to human intentions and preferences.
We propose a Hybrid Alignment Training (Hbat) approach, based on alternating alignment and modified elastic weight consolidation methods.
Experimental results show that the proposed textscHbat can significantly outperform all baselines.
arXiv Detail & Related papers (2024-06-21T14:23:57Z) - Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data [102.16105233826917]
Learning from preference labels plays a crucial role in fine-tuning large language models.
There are several distinct approaches for preference fine-tuning, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning.
arXiv Detail & Related papers (2024-04-22T17:20:18Z) - Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment [46.44464839353993]
We introduce Rewards-in-Context (RiC), which conditions the response of a foundation model on multiple rewards in its prompt context.
RiC only requires supervised fine-tuning of a single foundation model and supports dynamic adjustment for user preferences during inference time.
arXiv Detail & Related papers (2024-02-15T18:58:31Z) - From Instructions to Intrinsic Human Values -- A Survey of Alignment
Goals for Big Models [48.326660953180145]
We conduct a survey of different alignment goals in existing work and trace their evolution paths to help identify the most essential goal.
Our analysis reveals a goal transformation from fundamental abilities to value orientation, indicating the potential of intrinsic human values as the alignment goal for enhanced LLMs.
arXiv Detail & Related papers (2023-08-23T09:11:13Z) - An Approach to Ordering Objectives and Pareto Efficient Solutions [0.0]
Solutions to multi-objective optimization problems can generally not be compared or ordered.
Decision-makers are often made to believe that scaled objectives can be compared.
We present a method that uses the probability integral transform in order to map the objectives of a problem into scores that all share the same range.
arXiv Detail & Related papers (2022-05-30T17:55:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.