Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment
- URL: http://arxiv.org/abs/2402.10207v5
- Date: Wed, 5 Jun 2024 09:25:41 GMT
- Title: Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment
- Authors: Rui Yang, Xiaoman Pan, Feng Luo, Shuang Qiu, Han Zhong, Dong Yu, Jianshu Chen,
- Abstract summary: We introduce Rewards-in-Context (RiC), which conditions the response of a foundation model on multiple rewards in its prompt context.
RiC only requires supervised fine-tuning of a single foundation model and supports dynamic adjustment for user preferences during inference time.
- Score: 46.44464839353993
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of multi-objective alignment of foundation models with human preferences, which is a critical step towards helpful and harmless AI systems. However, it is generally costly and unstable to fine-tune large foundation models using reinforcement learning (RL), and the multi-dimensionality, heterogeneity, and conflicting nature of human preferences further complicate the alignment process. In this paper, we introduce Rewards-in-Context (RiC), which conditions the response of a foundation model on multiple rewards in its prompt context and applies supervised fine-tuning for alignment. The salient features of RiC are simplicity and adaptivity, as it only requires supervised fine-tuning of a single foundation model and supports dynamic adjustment for user preferences during inference time. Inspired by the analytical solution of an abstracted convex optimization problem, our dynamic inference-time adjustment method approaches the Pareto-optimal solution for multiple objectives. Empirical evidence demonstrates the efficacy of our method in aligning both Large Language Models (LLMs) and diffusion models to accommodate diverse rewards with only around 10% GPU hours compared with multi-objective RL baseline.
Related papers
- Joint Demonstration and Preference Learning Improves Policy Alignment with Human Feedback [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy.
The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms.
We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z) - Expensive Multi-Objective Bayesian Optimization Based on Diffusion Models [17.19004913553654]
Multi-objective Bayesian optimization (MOBO) has shown promising performance on various expensive multi-objective optimization problems (EMOPs)
We propose a novel Composite Diffusion Model based Pareto Set Learning algorithm, namely CDM-PSL, for expensive MOBO.
Our proposed algorithm attains superior performance compared with various state-of-the-art MOBO algorithms.
arXiv Detail & Related papers (2024-05-14T14:55:57Z) - Controllable Preference Optimization: Toward Controllable
Multi-Objective Alignment [107.63756895544842]
Alignment in artificial intelligence pursues consistency between model responses and human preferences as well as values.
Existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives.
We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives.
arXiv Detail & Related papers (2024-02-29T12:12:30Z) - Non-orthogonal Age-Optimal Information Dissemination in Vehicular
Networks: A Meta Multi-Objective Reinforcement Learning Approach [0.0]
A roadside unit (RSU) provides timely updates about a set of physical processes to vehicles.
The formulated problem is a multi-objective mixed-integer nonlinear programming problem.
We develop a hybrid deep Q-network (DQN)-deep deterministic policy gradient (DDPG) model to solve each optimization sub-problem.
arXiv Detail & Related papers (2024-02-15T16:51:47Z) - Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct
Preference Optimization [78.50294936259026]
We present Multi-Objective Direct Preference Optimization (MODPO) for multiple alignment objectives with minimal overheads.
MODPO folds language modeling directly into reward modeling, training LMs as implicit collective reward models (cRMs) that combine all objectives with specific weightings.
While theoretically guaranteed to produce the same optimal solutions as MORLHF, MODPO is practically more stable and computationally efficient.
arXiv Detail & Related papers (2023-10-05T17:35:26Z) - Exploiting Temporal Structures of Cyclostationary Signals for
Data-Driven Single-Channel Source Separation [98.95383921866096]
We study the problem of single-channel source separation (SCSS)
We focus on cyclostationary signals, which are particularly suitable in a variety of application domains.
We propose a deep learning approach using a U-Net architecture, which is competitive with the minimum MSE estimator.
arXiv Detail & Related papers (2022-08-22T14:04:56Z) - Pareto Set Learning for Neural Multi-objective Combinatorial
Optimization [6.091096843566857]
Multiobjective optimization (MOCO) problems can be found in many real-world applications.
We develop a learning-based approach to approximate the whole Pareto set for a given MOCO problem without further search procedure.
Our proposed method significantly outperforms some other methods on the multiobjective traveling salesman problem, multiconditioned vehicle routing problem and multi knapsack problem in terms of solution quality, speed, and model efficiency.
arXiv Detail & Related papers (2022-03-29T09:26:22Z) - Efficient Model-Based Multi-Agent Mean-Field Reinforcement Learning [89.31889875864599]
We propose an efficient model-based reinforcement learning algorithm for learning in multi-agent systems.
Our main theoretical contributions are the first general regret bounds for model-based reinforcement learning for MFC.
We provide a practical parametrization of the core optimization problem.
arXiv Detail & Related papers (2021-07-08T18:01:02Z) - Automatic selection of basis-adaptive sparse polynomial chaos expansions
for engineering applications [0.0]
We describe three state-of-the-art basis-adaptive approaches for sparse chaos expansions.
We conduct an extensive benchmark in terms of global approximation accuracy on a large set of computational models.
We introduce a novel solver and basis adaptivity selection scheme guided by cross-validation error.
arXiv Detail & Related papers (2020-09-10T12:13:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.