Related papers: Learning Pareto-Optimal Rewards from Noisy Preferences: A Framework for Multi-Objective Inverse Reinforcement Learning

Learning Pareto-Optimal Rewards from Noisy Preferences: A Framework for Multi-Objective Inverse Reinforcement Learning

URL: http://arxiv.org/abs/2505.11864v2
Date: Sat, 07 Jun 2025 04:39:58 GMT
Title: Learning Pareto-Optimal Rewards from Noisy Preferences: A Framework for Multi-Objective Inverse Reinforcement Learning
Authors: Kalyan Cherukuri, Aarav Lala,
Abstract summary: We introduce a theoretical framework for preference-based Multi-Objective Inverse Reinforcement Learning (MO-IRL), where human preferences are modeled as latent vector-valued reward functions.<n>Our results bridge the gap between practical alignment techniques and theoretical guarantees, providing a principled foundation for learning aligned behaviors.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As generative agents become increasingly capable, alignment of their behavior with complex human values remains a fundamental challenge. Existing approaches often simplify human intent through reduction to a scalar reward, overlooking the multi-faceted nature of human feedback. In this work, we introduce a theoretical framework for preference-based Multi-Objective Inverse Reinforcement Learning (MO-IRL), where human preferences are modeled as latent vector-valued reward functions. We formalize the problem of recovering a Pareto-optimal reward representation from noisy preference queries and establish conditions for identifying the underlying multi-objective structure. We derive tight sample complexity bounds for recovering $\epsilon$-approximations of the Pareto front and introduce a regret formulation to quantify suboptimality in this multi-objective setting. Furthermore, we propose a provably convergent algorithm for policy optimization using preference-inferred reward cones. Our results bridge the gap between practical alignment techniques and theoretical guarantees, providing a principled foundation for learning aligned behaviors in a high-dimension and value-pluralistic environment.

Related papers

Is Softmax Loss All You Need? A Principled Analysis of Softmax-family Loss [91.61796429377041]
The Softmax loss is one of the most widely employed surrogate objectives for classification and ranking tasks.<n>We investigate whether different surrogates achieve consistency with classification and ranking metrics, and analyze their gradient dynamics to reveal distinct convergence behaviors.<n>Our results establish a principled foundation and offer practical guidance for loss selections in large-class machine learning applications.
arXiv Detail & Related papers (2026-01-30T09:24:52Z)
MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z)
Multi-Objective Reinforcement Learning with Max-Min Criterion: A Game-Theoretic Approach [25.475842473998906]
We reformulate max-min multi-objective reinforcement learning as a two-player zero-sum regularized continuous game.<n>Our approach simplifies the policy update while ensuring global last-iterate convergence.<n>Our implementation for deep reinforcement learning significantly outperforms previous baselines in many MORL environments.
arXiv Detail & Related papers (2025-10-23T05:39:26Z)
COPO: Consistency-Aware Policy Optimization [17.328515578426227]
Reinforcement learning has significantly enhanced the reasoning capabilities of Large Language Models (LLMs) in complex problem-solving tasks.<n>Recently, the introduction of DeepSeek R1 has inspired a surge of interest in leveraging rule-based rewards as a low-cost alternative for computing advantage functions and guiding policy optimization.<n>We propose a consistency-aware policy optimization framework that introduces a structured global reward based on outcome consistency.
arXiv Detail & Related papers (2025-08-06T07:05:18Z)
Recursive Reward Aggregation [51.552609126905885]
We propose an alternative approach for flexible behavior alignment that eliminates the need to modify the reward function.<n>By introducing an algebraic perspective on Markov decision processes (MDPs), we show that the Bellman equations naturally emerge from the generation and aggregation of rewards.<n>Our approach applies to both deterministic and deterministic settings and seamlessly integrates with value-based and actor-critic algorithms.
arXiv Detail & Related papers (2025-07-11T12:37:20Z)
Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time [52.230936493691985]
We propose SITAlign, an inference framework that addresses the multifaceted nature of alignment by maximizing a primary objective while satisfying threshold-based constraints on secondary criteria.<n>We provide theoretical insights by deriving sub-optimality bounds of our satisficing based inference alignment approach.
arXiv Detail & Related papers (2025-05-29T17:56:05Z)
Multi-Level Aware Preference Learning: Enhancing RLHF for Complex Multi-Instruction Tasks [81.44256822500257]
RLHF has emerged as a predominant approach for aligning artificial intelligence systems with human preferences.<n> RLHF exhibits insufficient compliance capabilities when confronted with complex multi-instruction tasks.<n>We propose a novel Multi-level Aware Preference Learning (MAPL) framework, capable of enhancing multi-instruction capabilities.
arXiv Detail & Related papers (2025-05-19T08:33:11Z)
UC-MOA: Utility-Conditioned Multi-Objective Alignment for Distributional Pareto-Optimality [52.49062565901046]
Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone for aligning large language models with human values.<n>Existing approaches struggle to capture the multi-dimensional, distributional nuances of human preferences.<n>We introduce Utility-Conditioned Multi-Objective Alignment (UC-MOA), a novel framework that overcomes these limitations.
arXiv Detail & Related papers (2025-03-10T09:52:42Z)
Learning Dynamic Representations via An Optimally-Weighted Maximum Mean Discrepancy Optimization Framework for Continual Learning [16.10753846850319]
Continual learning allows models to persistently acquire and retain information.<n> catastrophic forgetting can severely impair model performance.<n>We introduce a novel framework termed Optimally-Weighted Mean Discrepancy (OWMMD), which imposes penalties on representation alterations.
arXiv Detail & Related papers (2025-01-21T13:33:45Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
Finite-Time Convergence and Sample Complexity of Actor-Critic Multi-Objective Reinforcement Learning [20.491176017183044]
This paper tackles the multi-objective reinforcement learning (MORL) problem. It introduces an innovative actor-critic algorithm named MOAC which finds a policy by iteratively making trade-offs among conflicting reward signals.
arXiv Detail & Related papers (2024-05-05T23:52:57Z)
Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment [46.44464839353993]
We introduce Rewards-in-Context (RiC), which conditions the response of a foundation model on multiple rewards in its prompt context. RiC only requires supervised fine-tuning of a single foundation model and supports dynamic adjustment for user preferences during inference time.
arXiv Detail & Related papers (2024-02-15T18:58:31Z)
Divide and Conquer: Provably Unveiling the Pareto Front with Multi-Objective Reinforcement Learning [5.897578963773195]
We introduce Iterated Pareto Referent optimisation (IPRO)<n>IPRO decomposes finding the Pareto front into a sequence of constrained single-objective problems.<n>By leveraging problem-specific single-objective solvers, our approach holds promise for applications beyond multi-objective reinforcement learning.
arXiv Detail & Related papers (2024-02-11T12:35:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.