Diffusion Classifier-Driven Reward for Offline Preference-based Reinforcement Learning
- URL: http://arxiv.org/abs/2503.01143v3
- Date: Wed, 24 Sep 2025 11:57:38 GMT
- Title: Diffusion Classifier-Driven Reward for Offline Preference-based Reinforcement Learning
- Authors: Teng Pang, Bingzheng Wang, Guoqiang Wu, Yilong Yin,
- Abstract summary: We propose a novel preference-based reward acquisition method: Diffusion Preference-based Reward (DPR)<n>DPR directly treats step-wise preference-based reward acquisition as a binary classification and utilizes the robustness of diffusion classifiers to infer step-wise rewards discriminatively.<n>We also propose Diffusion Preference-based Reward (C-DPR), which conditions on trajectory-wise preference labels to enhance reward inference.
- Score: 45.95668702930697
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline preference-based reinforcement learning (PbRL) mitigates the need for reward definition, aligning with human preferences via preference-driven reward feedback without interacting with the environment. However, trajectory-wise preference labels are difficult to meet the precise learning of step-wise reward, thereby affecting the performance of downstream algorithms. To alleviate the insufficient step-wise reward caused by trajectory-wise preferences, we propose a novel preference-based reward acquisition method: Diffusion Preference-based Reward (DPR). DPR directly treats step-wise preference-based reward acquisition as a binary classification and utilizes the robustness of diffusion classifiers to infer step-wise rewards discriminatively. In addition, to further utilize trajectory-wise preference information, we propose Conditional Diffusion Preference-based Reward (C-DPR), which conditions on trajectory-wise preference labels to enhance reward inference. We apply the above methods to existing offline RL algorithms, and a series of experimental results demonstrate that the diffusion classifier-driven reward outperforms the previous reward acquisition method with the Bradley-Terry model.
Related papers
- Learning in Context, Guided by Choice: A Reward-Free Paradigm for Reinforcement Learning with Transformers [55.33468902405567]
We propose a new learning paradigm, In-Context Preference-based Reinforcement Learning (ICPRL), in which both pretraining and deployment rely solely on preference feedback.<n>ICPRL enables strong in-context generalization to unseen tasks, achieving performance comparable to ICRL methods trained with full reward supervision.
arXiv Detail & Related papers (2026-02-09T03:42:16Z) - Learnable Chernoff Baselines for Inference-Time Alignment [64.81256817158851]
We introduce Learnable Chernoff Baselines as a method for efficiently and approximately sampling from exponentially tilted kernels.<n>We establish total-variation guarantees to the ideal aligned model, and demonstrate in both continuous and discrete diffusion settings that LCB sampling closely matches ideal rejection sampling.
arXiv Detail & Related papers (2026-02-08T00:09:40Z) - Divergence Minimization Preference Optimization for Diffusion Model Alignment [58.651951388346525]
Divergence Minimization Preference Optimization (DMPO) is a principled method for aligning diffusion models by minimizing reverse KL divergence.<n>Our results show that diffusion models fine-tuned with DMPO can consistently outperform or match existing techniques.<n>DMPO unlocks a robust and elegant pathway for preference alignment, bridging principled theory with practical performance in diffusion models.
arXiv Detail & Related papers (2025-07-10T07:57:30Z) - CLARIFY: Contrastive Preference Reinforcement Learning for Untangling Ambiguous Queries [13.06534916144093]
We propose Contrastive LeArning for ResolvIng Ambiguous Feedback (CLARIFY)<n>CLARIFY learns a trajectory embedding space that incorporates preference information, ensuring clearly distinguished segments are spaced apart.<n>Our approach not only selects more distinguished queries but also learns meaningful trajectory embeddings.
arXiv Detail & Related papers (2025-05-31T04:37:07Z) - VARD: Efficient and Dense Fine-Tuning for Diffusion Models with Value-based RL [28.95582264086289]
VAlue-based Reinforced Diffusion (VARD) is a novel approach that first learns a value function predicting expection of rewards from intermediate states.<n>Our method maintains proximity to the pretrained model while enabling effective and stable training via backpropagation.
arXiv Detail & Related papers (2025-05-21T17:44:37Z) - Prior-Guided Diffusion Planning for Offline Reinforcement Learning [4.760537994346813]
Prior Guidance (PG) is a novel guided sampling framework that replaces the standard Gaussian prior-of-cloned diffusion model.<n>PG directly generates high-value trajectories without costly reward optimization of the diffusion model itself.<n>We present an efficient training strategy that applies behavior regularization in latent space, and empirically demonstrate that PG outperforms state-the-art diffusion policies and planners across diverse long-horizon offline RL benchmarks.
arXiv Detail & Related papers (2025-05-16T05:39:02Z) - Latent Embedding Adaptation for Human Preference Alignment in Diffusion Planners [16.863492060519157]
This work addresses the challenge of personalizing trajectories generated in automated decision-making systems.
We propose a resource-efficient approach that enables rapid adaptation to individual users' preferences.
arXiv Detail & Related papers (2025-03-24T05:11:58Z) - Calibrated Multi-Preference Optimization for Aligning Diffusion Models [92.90660301195396]
Calibrated Preference Optimization (CaPO) is a novel method to align text-to-image (T2I) diffusion models.
CaPO incorporates the general preference from multiple reward models without human annotated data.
Experimental results show that CaPO consistently outperforms prior methods.
arXiv Detail & Related papers (2025-02-04T18:59:23Z) - Refining Alignment Framework for Diffusion Models with Intermediate-Step Preference Ranking [50.325021634589596]
We propose a Tailored Optimization Preference (TailorPO) framework for aligning diffusion models with human preference.<n>Our approach directly ranks intermediate noisy samples based on their step-wise reward, and effectively resolves the gradient direction issues.<n> Experimental results demonstrate that our method significantly improves the model's ability to generate aesthetically pleasing and human-preferred images.
arXiv Detail & Related papers (2025-02-01T16:08:43Z) - In-Dataset Trajectory Return Regularization for Offline Preference-based Reinforcement Learning [15.369324784520538]
We propose In-Dataset Trajectory Return Regularization (DTR) for offline preference-based reinforcement learning.<n>DTR mitigates the risk of learning inaccurate trajectory stitching under reward bias.<n>We also introduce an ensemble normalization technique that effectively integrates multiple reward models.
arXiv Detail & Related papers (2024-12-12T09:35:47Z) - Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [63.32585910975191]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.<n>We show that our approach consistently boosts DPO by a considerable margin.<n>Our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere data expansion.
arXiv Detail & Related papers (2024-10-10T16:01:51Z) - Training-free Diffusion Model Alignment with Sampling Demons [15.400553977713914]
We propose an optimization approach, dubbed Demon, to guide the denoising process at inference time without backpropagation through reward functions or model retraining.<n>Our approach works by controlling noise distribution in denoising steps to concentrate density on regions corresponding to high rewards through optimization.<n>Our experiments show that the proposed approach significantly improves the average aesthetics scores text-to-image generation.
arXiv Detail & Related papers (2024-10-08T07:33:49Z) - Hindsight Preference Learning for Offline Preference-based Reinforcement Learning [22.870967604847458]
Offline preference-based reinforcement learning (RL) focuses on optimizing policies using human preferences between pairs of trajectory segments selected from an offline dataset.
We propose to model human preferences using rewards conditioned on future outcomes of the trajectory segments.
Our proposed method, Hindsight Preference Learning (HPL), can facilitate credit assignment by taking full advantage of vast trajectory data available in massive unlabeled datasets.
arXiv Detail & Related papers (2024-07-05T12:05:37Z) - Preference Alignment with Flow Matching [23.042382086241364]
Preference Flow Matching (PFM) is a new framework for preference-based reinforcement learning (PbRL)
It streamlines the integration of preferences into an arbitrary class of pre-trained models.
We provide theoretical insights that support our method's alignment with standard PbRL objectives.
arXiv Detail & Related papers (2024-05-30T08:16:22Z) - Bridging Model-Based Optimization and Generative Modeling via Conservative Fine-Tuning of Diffusion Models [54.132297393662654]
We introduce a hybrid method that fine-tunes cutting-edge diffusion models by optimizing reward models through RL.
We demonstrate the capability of our approach to outperform the best designs in offline data, leveraging the extrapolation capabilities of reward models.
arXiv Detail & Related papers (2024-05-30T03:57:29Z) - Robust Preference Optimization through Reward Model Distillation [68.65844394615702]
Direct Preference Optimization (DPO) is a popular offline alignment method that trains a policy directly on preference data.<n>We analyze this phenomenon and use distillation to get a better proxy for the true preference distribution over generation pairs.<n>Our results show that distilling from such a family of reward models leads to improved robustness to distribution shift in preference annotations.
arXiv Detail & Related papers (2024-05-29T17:39:48Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - Learning a Diffusion Model Policy from Rewards via Q-Score Matching [93.0191910132874]
We present a theoretical framework linking the structure of diffusion model policies to a learned Q-function.<n>We propose a new policy update method from this theory, which we denote Q-score matching.
arXiv Detail & Related papers (2023-12-18T23:31:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.