Related papers: Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment

Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment

URL: http://arxiv.org/abs/2402.10207v5
Date: Wed, 5 Jun 2024 09:25:41 GMT
Title: Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment
Authors: Rui Yang, Xiaoman Pan, Feng Luo, Shuang Qiu, Han Zhong, Dong Yu, Jianshu Chen,
Abstract summary: We introduce Rewards-in-Context (RiC), which conditions the response of a foundation model on multiple rewards in its prompt context. RiC only requires supervised fine-tuning of a single foundation model and supports dynamic adjustment for user preferences during inference time.
Score: 46.44464839353993
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider the problem of multi-objective alignment of foundation models with human preferences, which is a critical step towards helpful and harmless AI systems. However, it is generally costly and unstable to fine-tune large foundation models using reinforcement learning (RL), and the multi-dimensionality, heterogeneity, and conflicting nature of human preferences further complicate the alignment process. In this paper, we introduce Rewards-in-Context (RiC), which conditions the response of a foundation model on multiple rewards in its prompt context and applies supervised fine-tuning for alignment. The salient features of RiC are simplicity and adaptivity, as it only requires supervised fine-tuning of a single foundation model and supports dynamic adjustment for user preferences during inference time. Inspired by the analytical solution of an abstracted convex optimization problem, our dynamic inference-time adjustment method approaches the Pareto-optimal solution for multiple objectives. Empirical evidence demonstrates the efficacy of our method in aligning both Large Language Models (LLMs) and diffusion models to accommodate diverse rewards with only around 10% GPU hours compared with multi-objective RL baseline.

Related papers

Reinforcement Learning for Multi-Objective Multi-Echelon Supply Chain Optimisation [3.1194372040101928]
The model is evaluated using a multi-objective reinforcement learning (RL) method, benchmarked against an originally single-objective RL algorithm modified with weighted sum.<n>We conduct experiments on varying network complexities, mimicking typical real-world challenges using a customisable simulator.<n>The model determines production and delivery quantities across supply chain routes to achieve near-optimal trade-offs between competing objectives.
arXiv Detail & Related papers (2025-07-26T04:30:11Z)
Learning Pareto-Optimal Rewards from Noisy Preferences: A Framework for Multi-Objective Inverse Reinforcement Learning [0.0]
We introduce a theoretical framework for preference-based Multi-Objective Inverse Reinforcement Learning (MO-IRL), where human preferences are modeled as latent vector-valued reward functions.<n>Our results bridge the gap between practical alignment techniques and theoretical guarantees, providing a principled foundation for learning aligned behaviors.
arXiv Detail & Related papers (2025-05-17T06:09:13Z)
REWARD CONSISTENCY: Improving Multi-Objective Alignment from a Data-Centric Perspective [16.79332387603131]
Multi-objective preference alignment in language models often encounters a challenging trade-off. We explore a novel data-driven approach to uncover the types of data that can effectively mitigate these conflicts. Our generated data achieves an average improvement of 13.37% in both the harmless rate and helpfulness win rate when optimizing harmlessness and helpfulness.
arXiv Detail & Related papers (2025-04-15T16:09:19Z)
Preference-Guided Diffusion for Multi-Objective Offline Optimization [64.08326521234228]
We propose a preference-guided diffusion model for offline multi-objective optimization. Our guidance is a preference model trained to predict the probability that one design dominates another. Our results highlight the effectiveness of classifier-guided diffusion models in generating diverse and high-quality solutions.
arXiv Detail & Related papers (2025-03-21T16:49:38Z)
UC-MOA: Utility-Conditioned Multi-Objective Alignment for Distributional Pareto-Optimality [15.53963063493065]
Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone for aligning large language models with human values. Existing approaches struggle to capture the multi-dimensional, distributional nuances of human preferences. We introduce Utility-Conditioned Multi-Objective Alignment (UC-MOA), a novel framework that overcomes these limitations.
arXiv Detail & Related papers (2025-03-10T09:52:42Z)
Robust Multi-Objective Preference Alignment with Online DPO [6.434799451791957]
Multi-objective preference alignment is critical for developing AI systems that are personalizable, helpful, and safe. Existing approaches are either computationally expensive to train or do not sufficiently steer model behaviors. This paper introduces the Multi-Objective Online DPO algorithm, designed to robustly and efficiently align model behaviors with multiple, potentially conflicting human preferences.
arXiv Detail & Related papers (2025-03-01T02:01:49Z)
MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment [14.541973333460149]
Mixing Preference Optimization (MPO) is a post-processing framework for aggregating single-objective policies. MPO achieves balanced performance across diverse preferences, outperforming existing models with significantly reduced computational costs.
arXiv Detail & Related papers (2025-02-25T23:22:12Z)
Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes [50.544186914115045]
Large language models (LLMs) are increasingly embedded in everyday applications. Ensuring their alignment with the diverse preferences of individual users has become a critical challenge. We present a novel framework for few-shot steerable alignment.
arXiv Detail & Related papers (2024-12-18T16:14:59Z)
Adaptive Learning of Design Strategies over Non-Hierarchical Multi-Fidelity Models via Policy Alignment [0.0]
Multi-fidelity Reinforcement Learning frameworks enhance the efficiency of engineering design by leveraging analysis models with varying levels of accuracy and computational costs. This work proposes ALPHA, a novel multi-fidelity RL framework to efficiently learn a high-fidelity policy by adaptively leveraging an arbitrary set of non-hierarchical, heterogeneous, low-fidelity models alongside a high-fidelity model. The effectiveness of ALPHA is demonstrated in analytical test optimization and octocopter design problems, utilizing two low-fidelity models alongside a high-fidelity one.
arXiv Detail & Related papers (2024-11-16T16:54:33Z)
TSO: Self-Training with Scaled Preference Optimization [14.3799656174528]
We propose TSO, a framework for preference optimization that conducts self-training preference learning without training an additional reward model. TSO enhances the diversity of responses by constructing a model matrix and incorporating human preference responses. Experimental results demonstrate that TSO outperforms existing mainstream methods on various alignment evaluation benchmarks.
arXiv Detail & Related papers (2024-08-31T05:37:01Z)
Joint Demonstration and Preference Learning Improves Policy Alignment with Human Feedback [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy. The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms. We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z)
Context-aware Diversity Enhancement for Neural Multi-Objective Combinatorial Optimization [19.631213689157995]
Multi-objective optimization (MOCO) problems are prevalent in various real-world applications. We propose a Context-aware Diversity Enhancement algorithm named CDE. The proposed CDE can effectively and efficiently grasp the context information, resulting in diversity enhancement.
arXiv Detail & Related papers (2024-05-14T13:42:19Z)
UCB-driven Utility Function Search for Multi-objective Reinforcement Learning [75.11267478778295]
In Multi-objective Reinforcement Learning (MORL) agents are tasked with optimising decision-making behaviours. We focus on the case of linear utility functions parameterised by weight vectors w. We introduce a method based on Upper Confidence Bound to efficiently search for the most promising weight vectors during different stages of the learning process.
arXiv Detail & Related papers (2024-05-01T09:34:42Z)
Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment [103.12563033438715]
Alignment in artificial intelligence pursues consistency between model responses and human preferences as well as values. Existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives. We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives.
arXiv Detail & Related papers (2024-02-29T12:12:30Z)
Dependency Structure Search Bayesian Optimization for Decision Making Models [29.95525433889418]
We propose a compact multi-layered architecture modeling the dynamics of agent interactions through the concept of role. We show strong empirical results under malformed or sparse reward.
arXiv Detail & Related papers (2023-08-01T15:56:24Z)
Exploiting Temporal Structures of Cyclostationary Signals for Data-Driven Single-Channel Source Separation [98.95383921866096]
We study the problem of single-channel source separation (SCSS) We focus on cyclostationary signals, which are particularly suitable in a variety of application domains. We propose a deep learning approach using a U-Net architecture, which is competitive with the minimum MSE estimator.
arXiv Detail & Related papers (2022-08-22T14:04:56Z)
Pareto Set Learning for Neural Multi-objective Combinatorial Optimization [6.091096843566857]
Multiobjective optimization (MOCO) problems can be found in many real-world applications. We develop a learning-based approach to approximate the whole Pareto set for a given MOCO problem without further search procedure. Our proposed method significantly outperforms some other methods on the multiobjective traveling salesman problem, multiconditioned vehicle routing problem and multi knapsack problem in terms of solution quality, speed, and model efficiency.
arXiv Detail & Related papers (2022-03-29T09:26:22Z)
Efficient Model-Based Multi-Agent Mean-Field Reinforcement Learning [89.31889875864599]
We propose an efficient model-based reinforcement learning algorithm for learning in multi-agent systems. Our main theoretical contributions are the first general regret bounds for model-based reinforcement learning for MFC. We provide a practical parametrization of the core optimization problem.
arXiv Detail & Related papers (2021-07-08T18:01:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.