Related papers: MODULI: Unlocking Preference Generalization via Diffusion Models for Offline Multi-Objective Reinforcement Learning

MODULI: Unlocking Preference Generalization via Diffusion Models for Offline Multi-Objective Reinforcement Learning

URL: http://arxiv.org/abs/2408.15501v2
Date: Tue, 27 May 2025 03:51:30 GMT
Title: MODULI: Unlocking Preference Generalization via Diffusion Models for Offline Multi-Objective Reinforcement Learning
Authors: Yifu Yuan, Zhenrui Zheng, Zibin Dong, Jianye Hao,
Abstract summary: Multi-objective Reinforcement Learning (MORL) seeks to develop policies that simultaneously optimize multiple conflicting objectives.<n>Offline MORL provides a promising solution by training on pre-collected datasets to generalize to any preference upon deployment.<n>Existing offline MORL algorithms exhibit poor generalization to OOD preferences, resulting in policies that do not align with preferences.<n>We propose MODULI, which employs a preference-conditioned diffusion model as a planner to generate trajectories that align with various preferences.
Score: 36.24516512865215
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-objective Reinforcement Learning (MORL) seeks to develop policies that simultaneously optimize multiple conflicting objectives, but it requires extensive online interactions. Offline MORL provides a promising solution by training on pre-collected datasets to generalize to any preference upon deployment. However, real-world offline datasets are often conservatively and narrowly distributed, failing to comprehensively cover preferences, leading to the emergence of out-of-distribution (OOD) preference areas. Existing offline MORL algorithms exhibit poor generalization to OOD preferences, resulting in policies that do not align with preferences. Leveraging the excellent expressive and generalization capabilities of diffusion models, we propose MODULI (Multi-objective Diffusion Planner with Sliding Guidance), which employs a preference-conditioned diffusion model as a planner to generate trajectories that align with various preferences and derive action for decision-making. To achieve accurate generation, MODULI introduces two return normalization methods under diverse preferences for refining guidance. To further enhance generalization to OOD preferences, MODULI proposes a novel sliding guidance mechanism, which involves training an additional slider adapter to capture the direction of preference changes. Incorporating the slider, it transitions from in-distribution (ID) preferences to generating OOD preferences, patching, and extending the incomplete Pareto front. Extensive experiments on the D4MORL benchmark demonstrate that our algorithm outperforms state-of-the-art Offline MORL baselines, exhibiting excellent generalization to OOD preferences.

Related papers

Prefer-DAS: Learning from Local Preferences and Sparse Prompts for Domain Adaptive Segmentation of Electron Microscopy [0.786519149320184]
Prefer-DAS is a promptable multitask model that integrates self-training and prompt-guided contrastive learning.<n>We introduce Local direct Preference Optimization (LPO) and sparse LPO, plug-and-play solutions for alignment with spatially varying human feedback or sparse feedback.<n>Our model can effectively perform both weakly-supervised and unsupervised DAS, depending on the availability of points and human preferences.
arXiv Detail & Related papers (2026-02-23T01:39:03Z)
RecNet: Self-Evolving Preference Propagation for Agentic Recommender Systems [109.9061591263748]
RecNet is a self-evolving preference propagation framework for recommender systems.<n>It proactively propagates real-time preference updates across related users and items.<n>In the backward phase, the feedback-driven propagation optimization mechanism simulates a multi-agent reinforcement learning framework.
arXiv Detail & Related papers (2026-01-29T12:14:31Z)
From RLHF to Direct Alignment: A Theoretical Unification of Preference Learning for Large Language Models [0.7366405857677227]
This survey provides a textittheoretical unification of preference learning methods.<n>We formalize each axis with precise definitions and theorems.<n>We synthesize empirical findings across 50+ papers and provide a practitioner's decision guide for method selection.
arXiv Detail & Related papers (2026-01-03T08:33:26Z)
Multimodal Large Language Models with Adaptive Preference Optimization for Sequential Recommendation [60.33386541343322]
We propose a Multimodal Large Language Models framework that integrates Hardness-aware and Noise-regularized preference optimization for Recommendation (HaNoRec)<n>Specifically, HaNoRec dynamically adjusts optimization weights based on both the estimated hardness of each training sample and the policy model's real-time responsiveness.
arXiv Detail & Related papers (2025-11-24T04:10:46Z)
Plug-and-Play Prompt Refinement via Latent Feedback for Diffusion Model Alignment [54.17386822940477]
We introduce PromptLoop, a plug-and-play reinforcement learning framework that incorporates latent feedback into step-wise prompt refinement.<n>This design achieves a structural analogy to the Diffusion RL approach, while retaining the flexibility and generality of prompt-based alignment.
arXiv Detail & Related papers (2025-10-01T02:18:58Z)
Divergence Minimization Preference Optimization for Diffusion Model Alignment [58.651951388346525]
Divergence Minimization Preference Optimization (DMPO) is a principled method for aligning diffusion models by minimizing reverse KL divergence.<n>Our results show that diffusion models fine-tuned with DMPO can consistently outperform or match existing techniques.<n>DMPO unlocks a robust and elegant pathway for preference alignment, bridging principled theory with practical performance in diffusion models.
arXiv Detail & Related papers (2025-07-10T07:57:30Z)
Preference-Guided Diffusion for Multi-Objective Offline Optimization [64.08326521234228]
We propose a preference-guided diffusion model for offline multi-objective optimization. Our guidance is a preference model trained to predict the probability that one design dominates another. Our results highlight the effectiveness of classifier-guided diffusion models in generating diverse and high-quality solutions.
arXiv Detail & Related papers (2025-03-21T16:49:38Z)
DPR: Diffusion Preference-based Reward for Offline Reinforcement Learning [30.654668373387214]
We propose a novel preference-based reward acquisition method: Diffusion Preference-based Reward (DPR) DPR uses diffusion models to directly model preference distributions for state-action pairs, allowing rewards to be discriminatively obtained from these distributions. We apply the above methods to existing offline reinforcement learning algorithms and a series of experiment results demonstrate that the diffusion-based reward acquisition approach outperforms previous-based and Transformer-based methods.
arXiv Detail & Related papers (2025-03-03T03:49:38Z)
Self-Improvement Towards Pareto Optimality: Mitigating Preference Conflicts in Multi-Objective Alignment [74.25832963097658]
Multi-Objective Alignment (MOA) aims to align responses with multiple human preference objectives. We find that DPO-based MOA approaches suffer from widespread preference conflicts in the data.
arXiv Detail & Related papers (2025-02-20T08:27:00Z)
Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes [50.544186914115045]
Large language models (LLMs) are increasingly embedded in everyday applications. Ensuring their alignment with the diverse preferences of individual users has become a critical challenge. We present a novel framework for few-shot steerable alignment.
arXiv Detail & Related papers (2024-12-18T16:14:59Z)
Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data. Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z)
Robust Preference Optimization through Reward Model Distillation [68.65844394615702]
Language model (LM) post-training involves maximizing a reward function that is derived from preference annotations. DPO is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning. We analyze this phenomenon and propose distillation to get a better proxy for the true preference distribution over generation pairs.
arXiv Detail & Related papers (2024-05-29T17:39:48Z)
Hybrid Preference Optimization: Augmenting Direct Preference Optimization with Auxiliary Objectives [0.5120567378386615]
We propose a hybrid approach to aligning large language models (LLMs) With a simple augmentation to the implicit reward decomposition of DPO, we allow for tuning LLMs to maximize a set of arbitrary auxiliary rewards. The proposed method, Hybrid Preference Optimization (HPO), shows the ability to effectively generalize to both user preferences and auxiliary designer objectives.
arXiv Detail & Related papers (2024-05-28T08:35:48Z)
Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models. The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models. Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z)
Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment [103.12563033438715]
Alignment in artificial intelligence pursues consistency between model responses and human preferences as well as values. Existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives. We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives.
arXiv Detail & Related papers (2024-02-29T12:12:30Z)
Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization [76.09576643028362]
We present Multi-Objective Direct Preference Optimization (MODPO) for multiple alignment objectives. MODPO folds language modeling directly into reward modeling, training language models as implicit collective reward models. It theoretically yields the same optimal solutions as MORLHF but is practically more stable and efficient.
arXiv Detail & Related papers (2023-10-05T17:35:26Z)
Scaling Pareto-Efficient Decision Making Via Offline Multi-Objective RL [22.468486569700236]
The goal of multi-objective reinforcement learning (MORL) is to learn policies that simultaneously optimize multiple competing objectives. We propose a new data-driven setup for offline MORL, where we wish to learn a preference-agnostic policy agent. PEDA is a family of offline MORL algorithms that builds and extends Decision Transformers via a novel preference-and-return-conditioned policy.
arXiv Detail & Related papers (2023-04-30T20:15:26Z)
PD-MORL: Preference-Driven Multi-Objective Reinforcement Learning Algorithm [0.18416014644193063]
We propose a novel MORL algorithm that trains a single universal network to cover the entire preference space scalable to continuous robotic tasks. PD-MORL achieves up to 25% larger hypervolume for challenging continuous control tasks and uses an order of magnitude fewer trainable parameters compared to prior approaches.
arXiv Detail & Related papers (2022-08-16T19:23:02Z)
gTLO: A Generalized and Non-linear Multi-Objective Deep Reinforcement Learning Approach [2.0305676256390934]
Generalized Thresholded Lexicographic Ordering (gTLO) is a novel method that aims to combine non-linear MORL with the advantages of generalized MORL. We present promising results on a standard benchmark for non-linear MORL and a real-world application from the domain of manufacturing process control.
arXiv Detail & Related papers (2022-04-11T10:06:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.