PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models
- URL: http://arxiv.org/abs/2402.08714v2
- Date: Wed, 27 Mar 2024 21:37:39 GMT
- Title: PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models
- Authors: Fei Deng, Qifei Wang, Wei Wei, Matthias Grundmann, Tingbo Hou,
- Abstract summary: Reward finetuning has emerged as a promising approach to aligning foundation models with downstream objectives.
Existing reward finetuning methods are limited by their instability in large-scale prompt datasets.
We propose Proximal Reward Difference Prediction (PRDP) to enable stable black-box reward finetuning for diffusion models.
- Score: 13.313186665410486
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reward finetuning has emerged as a promising approach to aligning foundation models with downstream objectives. Remarkable success has been achieved in the language domain by using reinforcement learning (RL) to maximize rewards that reflect human preference. However, in the vision domain, existing RL-based reward finetuning methods are limited by their instability in large-scale training, rendering them incapable of generalizing to complex, unseen prompts. In this paper, we propose Proximal Reward Difference Prediction (PRDP), enabling stable black-box reward finetuning for diffusion models for the first time on large-scale prompt datasets with over 100K prompts. Our key innovation is the Reward Difference Prediction (RDP) objective that has the same optimal solution as the RL objective while enjoying better training stability. Specifically, the RDP objective is a supervised regression objective that tasks the diffusion model with predicting the reward difference of generated image pairs from their denoising trajectories. We theoretically prove that the diffusion model that obtains perfect reward difference prediction is exactly the maximizer of the RL objective. We further develop an online algorithm with proximal updates to stably optimize the RDP objective. In experiments, we demonstrate that PRDP can match the reward maximization ability of well-established RL-based methods in small-scale training. Furthermore, through large-scale training on text prompts from the Human Preference Dataset v2 and the Pick-a-Pic v1 dataset, PRDP achieves superior generation quality on a diverse set of complex, unseen prompts whereas RL-based methods completely fail.
Related papers
- Data-regularized Reinforcement Learning for Diffusion Models at Scale [99.01056178660538]
We introduce Data-regularized Diffusion Reinforcement Learning ( DDRL), a novel framework that uses the forward KL divergence to anchor the policy to an off-policy data distribution.<n>With over a million GPU hours of experiments and ten thousand double-blind evaluations, we demonstrate that DDRL significantly improves rewards while alleviating the reward hacking seen in RLs.
arXiv Detail & Related papers (2025-12-03T23:45:07Z) - DiFFPO: Training Diffusion LLMs to Reason Fast and Furious via Reinforcement Learning [37.20873499361773]
We propose a unified framework for training masked diffusion large language models (dLLMs) to reason better (furious)<n>We first unify the existing baseline approach by proposing to train surrogate policies via off-policy RL, whose likelihood is much more tractable as an approximation to the true dLLM policy.<n>We also propose a new direction of joint training efficient samplers/controllers of dLLMs policy. Via RL, we incentivize dLLMs' natural multi-token prediction capabilities by letting the model learn to adaptively allocate an inference threshold for each prompt.
arXiv Detail & Related papers (2025-10-02T16:57:24Z) - SPARK: Synergistic Policy And Reward Co-Evolving Framework [84.22494672256894]
We introduce the Synergistic Policy And Reward Co-Evolving Framework (SPARK), an efficient, on-policy, and stable method that builds on RLVR.<n>Instead of discarding rollouts and correctness data, SPARK recycles this valuable information to simultaneously train the model itself as a generative reward model.<n>We show that SPARK achieves significant performance gains on multiple LLM and LVLM models and multiple reasoning, reward models, and general benchmarks.
arXiv Detail & Related papers (2025-09-26T17:50:12Z) - Reinforcement Learning on Pre-Training Data [55.570379963147424]
We introduce Reinforcement Learning on Pre-Training data (R), a new training-time scaling paradigm for optimizing large language models (LLMs)<n>R enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL)<n>Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of R.
arXiv Detail & Related papers (2025-09-23T17:10:40Z) - Perception-Aware Policy Optimization for Multimodal Reasoning [79.56070395437898]
A major source of error in current multimodal reasoning lies in the perception of visual inputs.<n>We propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason.<n>We observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO.
arXiv Detail & Related papers (2025-07-08T23:22:34Z) - Fake it till You Make it: Reward Modeling as Discriminative Prediction [49.31309674007382]
GAN-RM is an efficient reward modeling framework that eliminates manual preference annotation and explicit quality dimension engineering.<n>Our method trains the reward model through discrimination between a small set of representative, unpaired target samples.<n>Experiments demonstrate our GAN-RM's effectiveness across multiple key applications.
arXiv Detail & Related papers (2025-06-16T17:59:40Z) - Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO [51.22869332661607]
We decompose the performance gap between reinforcement learning from human feedback and direct preference optimization under a representation gap.<n>We show that RLHF, DPO, or online DPO can outperform one another depending on the type of model mis-specifications.
arXiv Detail & Related papers (2025-05-26T09:54:02Z) - VARD: Efficient and Dense Fine-Tuning for Diffusion Models with Value-based RL [28.95582264086289]
VAlue-based Reinforced Diffusion (VARD) is a novel approach that first learns a value function predicting expection of rewards from intermediate states.<n>Our method maintains proximity to the pretrained model while enabling effective and stable training via backpropagation.
arXiv Detail & Related papers (2025-05-21T17:44:37Z) - DPR: Diffusion Preference-based Reward for Offline Reinforcement Learning [30.654668373387214]
We propose a novel preference-based reward acquisition method: Diffusion Preference-based Reward (DPR)
DPR uses diffusion models to directly model preference distributions for state-action pairs, allowing rewards to be discriminatively obtained from these distributions.
We apply the above methods to existing offline reinforcement learning algorithms and a series of experiment results demonstrate that the diffusion-based reward acquisition approach outperforms previous-based and Transformer-based methods.
arXiv Detail & Related papers (2025-03-03T03:49:38Z) - Distributionally Robust Reinforcement Learning with Human Feedback [13.509499718691016]
We introduce a distributionally robust RLHF for fine-tuning large language models.
Our goal is to ensure that a fine-tuned model retains its performance even when the distribution of prompts significantly differs.
We show that our robust training improves the accuracy of the learned reward models on average, and markedly on some tasks, such as reasoning.
arXiv Detail & Related papers (2025-03-01T15:43:39Z) - Best Policy Learning from Trajectory Preference Feedback [11.896067099790962]
Preference-based Reinforcement Learning (PbRL) offers a more robust alternative.<n>We study the best policy identification problem in PbRL, motivated by post-training optimization of generative models.<n>We propose Posterior Sampling for Preference Learning ($mathsfPSPL$), a novel algorithm inspired by Top-Two Thompson Sampling.
arXiv Detail & Related papers (2025-01-31T03:55:10Z) - Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward [7.574124278327481]
We propose fine-tuning diffusion models with learned differentiable surrogate rewards.
Our method, named LaSRO, learns surrogate reward models in the latent space of SDXL to convert arbitrary rewards into differentiable ones.
LaSRO is effective and stable for improving ultra-fast image generation with different reward objectives.
arXiv Detail & Related papers (2024-11-22T08:00:20Z) - RED: Unleashing Token-Level Rewards from Holistic Feedback via Reward Redistribution [50.171320156632866]
Reinforcement learning from human feedback offers a promising approach to aligning large language models with human preferences.<n>Current reward models operate as sequence-to-one models, allocating a single, sparse, and delayed reward to an entire output sequence.<n>We propose a more fine-grained, token-level guidance approach for RL training.
arXiv Detail & Related papers (2024-11-13T02:45:21Z) - How to Evaluate Reward Models for RLHF [51.31240621943791]
We introduce a new benchmark for reward models that quantifies their ability to produce strong language models through RLHF (Reinforcement Learning from Human Feedback)
We build a predictive model of downstream LLM performance by evaluating the reward model on proxy tasks.
We launch an end-to-end RLHF experiment on a large-scale crowdsourced human preference platform to view real reward model downstream performance as ground truth.
arXiv Detail & Related papers (2024-10-18T21:38:21Z) - Zeroth-Order Policy Gradient for Reinforcement Learning from Human
Feedback without Reward Inference [17.76565371753346]
This paper develops two RLHF algorithms without reward inference.
The key idea is to estimate the local value function difference from human preferences and then approximate the policy gradient with a zeroth-order gradient approximator.
Our results show there exist provably efficient methods to solve general RLHF problems without reward inference.
arXiv Detail & Related papers (2024-09-25T22:20:11Z) - Reward-Directed Score-Based Diffusion Models via q-Learning [8.725446812770791]
We propose a new reinforcement learning (RL) formulation for training continuous-time score-based diffusion models for generative AI.
Our formulation does not involve any pretrained model for the unknown score functions of the noise-perturbed data distributions.
arXiv Detail & Related papers (2024-09-07T13:55:45Z) - On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization [25.76847680704863]
Two main approaches for learning a reward model are 1) training an EXplicit Reward Model (EXRM) as in RLHF, and 2) using an implicit reward learned from preference data through methods such as Direct Preference Optimization (DPO)
This work studies the accuracy at distinguishing preferred and rejected answers for both DPORM and EXRM.
arXiv Detail & Related papers (2024-09-05T16:08:19Z) - Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF [82.7679132059169]
Reinforcement learning from human feedback has emerged as a central tool for language model alignment.
We propose a new algorithm for online exploration in RLHF, Exploratory Preference Optimization (XPO)
XPO enjoys the strongest known provable guarantees and promising empirical performance.
arXiv Detail & Related papers (2024-05-31T17:39:06Z) - Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble [67.4269821365504]
Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values.
However, RLHF relies on a reward model that is trained with a limited amount of human preference data.
We contribute a reward ensemble method that allows the reward model to make more accurate predictions.
arXiv Detail & Related papers (2024-01-30T00:17:37Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - Reinforcement Learning from Diverse Human Preferences [68.4294547285359]
This paper develops a method for crowd-sourcing preference labels and learning from diverse human preferences.
The proposed method is tested on a variety of tasks in DMcontrol and Meta-world.
It has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback.
arXiv Detail & Related papers (2023-01-27T15:18:54Z) - Simplifying Model-based RL: Learning Representations, Latent-space
Models, and Policies with One Objective [142.36200080384145]
We propose a single objective which jointly optimize a latent-space model and policy to achieve high returns while remaining self-consistent.
We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods.
arXiv Detail & Related papers (2022-09-18T03:51:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.