Preference Conditioned Multi-Objective Reinforcement Learning: Decomposed, Diversity-Driven Policy Optimization
- URL: http://arxiv.org/abs/2602.07764v1
- Date: Sun, 08 Feb 2026 01:45:01 GMT
- Title: Preference Conditioned Multi-Objective Reinforcement Learning: Decomposed, Diversity-Driven Policy Optimization
- Authors: Tanmay Ambadkar, Sourav Panda, Shreyash Kale, Jonathan Dodge, Abhinav Verma,
- Abstract summary: Multi-objective reinforcement learning seeks to learn policies that balance multiple, often conflicting objectives.<n>We introduce $D3PO$, a PPO-based framework that reorganizes multi-objective policy optimization to address these issues directly.<n>$D3PO$ preserves per-objective learning signals through a decomposed optimization pipeline and integrates preferences only after stabilization.
- Score: 2.595968385299781
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-objective reinforcement learning (MORL) seeks to learn policies that balance multiple, often conflicting objectives. Although a single preference-conditioned policy is the most flexible and scalable solution, existing approaches remain brittle in practice, frequently failing to recover complete Pareto fronts. We show that this failure stems from two structural issues in current methods: destructive gradient interference caused by premature scalarization and representational collapse across the preference space. We introduce $D^3PO$, a PPO-based framework that reorganizes multi-objective policy optimization to address these issues directly. $D^3PO$ preserves per-objective learning signals through a decomposed optimization pipeline and integrates preferences only after stabilization, enabling reliable credit assignment. In addition, a scaled diversity regularizer enforces sensitivity of policy behavior to preference changes, preventing collapse. Across standard MORL benchmarks, including high-dimensional and many-objective control tasks, $D^3PO$ consistently discovers broader and higher-quality Pareto fronts than prior single- and multi-policy methods, matching or exceeding state-of-the-art hypervolume and expected utility while using a single deployable policy.
Related papers
- MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z) - GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization [133.27496265096445]
We show how to apply Group Relative Policy Optimization under multi-reward setting without examining its suitability.<n>We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues.<n>GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
arXiv Detail & Related papers (2026-01-08T18:59:24Z) - ACPO: Adaptive Curriculum Policy Optimization for Aligning Vision-Language Models in Complex Reasoning [17.928214942495412]
ACPO employs a dynamic curriculum that orchestrates a principled transition from a stable, near on-policy exploration phase to an efficient, off-policy exploitation phase.<n>We conduct extensive experiments on a suite of challenging multimodal reasoning benchmarks, including MathVista, LogicVista, and MMMU-Pro.<n>Results demonstrate that ACPO consistently outperforms strong baselines such as DAPO and PAPO, achieving state-of-the-art performance, accelerated convergence, and superior training stability.
arXiv Detail & Related papers (2025-10-01T09:11:27Z) - Principled Multimodal Representation Learning [99.53621521696051]
Multimodal representation learning seeks to create a unified representation space by integrating diverse data modalities.<n>Recent advances have investigated the simultaneous alignment of multiple modalities, yet several challenges remain.<n>We propose Principled Multimodal Representation Learning (PMRL), a novel framework that achieves simultaneous alignment of multiple modalities.
arXiv Detail & Related papers (2025-07-23T09:12:25Z) - Multi-Objective Preference Optimization: Improving Human Alignment of Generative Models [15.799929216215672]
We introduce the Multi-Objective Preference Optimization (MOPO) algorithm, which frames alignment as a constrained KL-regularized optimization.<n>Unlike prior work, MOPO operates directly on pairwise preference data, requires no point-wise reward assumption, and avoids prompt-context engineering.
arXiv Detail & Related papers (2025-05-16T05:58:26Z) - Robust Offline Reinforcement Learning with Linearly Structured f-Divergence Regularization [11.739526562075339]
The Robust Regularized Markov Decision Process (RRMDP) is proposed to learn policies robust to dynamics shifts by adding regularization to the transition dynamics in the value function.<n>We develop the Robust Regularized Pessimistic Value Iteration (R2PVI) algorithm that employs linear function approximation for robust policy learning in $d$-RRMDPs with $f$-divergence based regularization terms on transition kernels.
arXiv Detail & Related papers (2024-11-27T18:57:03Z) - C-MORL: Multi-Objective Reinforcement Learning through Efficient Discovery of Pareto Front [9.04360155372014]
Constrained MORL is a seamless bridge between constrained policy optimization and MORL.<n>Our algorithm achieves more consistent and superior performances in terms of hypervolume, expected utility, and sparsity on both discrete and continuous control tasks.
arXiv Detail & Related papers (2024-10-03T06:13:56Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - UCB-driven Utility Function Search for Multi-objective Reinforcement Learning [51.00436121587591]
In Multi-objective Reinforcement Learning (MORL) agents are tasked with optimising decision-making behaviours.<n>We focus on the case of linear utility functions parametrised by weight vectors w.<n>We introduce a method based on Upper Confidence Bound to efficiently search for the most promising weight vectors during different stages of the learning process.
arXiv Detail & Related papers (2024-05-01T09:34:42Z) - Conflict-Averse Gradient Aggregation for Constrained Multi-Objective Reinforcement Learning [13.245000585002858]
In many real-world applications, a reinforcement learning (RL) agent should consider multiple objectives and adhere to safety guidelines.
We propose a constrained multi-objective gradient aggregation algorithm named Constrained Multi-Objective Gradient Aggregator (CoGAMO)
arXiv Detail & Related papers (2024-03-01T04:57:13Z) - Policy-regularized Offline Multi-objective Reinforcement Learning [11.58560880898882]
We extend the offline policy-regularized method, a widely-adopted approach for single-objective offline RL problems, into the multi-objective setting.
We propose two solutions to this problem: 1) filtering out preference-inconsistent demonstrations via approximating behavior preferences, and 2) adopting regularization techniques with high policy expressiveness.
arXiv Detail & Related papers (2024-01-04T12:54:10Z) - CRPO: A New Approach for Safe Reinforcement Learning with Convergence
Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints.
This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.