Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization
- URL: http://arxiv.org/abs/2502.01051v1
- Date: Mon, 03 Feb 2025 04:51:28 GMT
- Title: Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization
- Authors: Tao Zhang, Cheng Da, Kun Ding, Kun Jin, Yan Li, Tingting Gao, Di Zhang, Shiming Xiang, Chunhong Pan,
- Abstract summary: Preference optimization for diffusion models aims to align them with human preferences for images.
Previous methods typically leverage Vision-Language Models (VLMs) as pixel-level reward models to approximate human preferences.
In this work, we demonstrate that diffusion models are inherently well-suited for step-level reward modeling in the latent space.
We introduce Latent Preference Optimization (LPO), a method designed for step-level preference optimization directly in the latent space.
- Score: 46.888425016169144
- License:
- Abstract: Preference optimization for diffusion models aims to align them with human preferences for images. Previous methods typically leverage Vision-Language Models (VLMs) as pixel-level reward models to approximate human preferences. However, when used for step-level preference optimization, these models face challenges in handling noisy images of different timesteps and require complex transformations into pixel space. In this work, we demonstrate that diffusion models are inherently well-suited for step-level reward modeling in the latent space, as they can naturally extract features from noisy latent images. Accordingly, we propose the Latent Reward Model (LRM), which repurposes components of diffusion models to predict preferences of latent images at various timesteps. Building on LRM, we introduce Latent Preference Optimization (LPO), a method designed for step-level preference optimization directly in the latent space. Experimental results indicate that LPO not only significantly enhances performance in aligning diffusion models with general, aesthetic, and text-image alignment preferences, but also achieves 2.5-28$\times$ training speedup compared to existing preference optimization methods. Our code will be available at https://github.com/casiatao/LPO.
Related papers
- Refining Alignment Framework for Diffusion Models with Intermediate-Step Preference Ranking [50.325021634589596]
We propose a Tailored Optimization Preference (TailorPO) framework for aligning diffusion models with human preference.
Our approach directly ranks intermediate noisy samples based on their step-wise reward, and effectively resolves the gradient direction issues.
Experimental results demonstrate that our method significantly improves the model's ability to generate aesthetically pleasing and human-preferred images.
arXiv Detail & Related papers (2025-02-01T16:08:43Z) - Personalized Preference Fine-tuning of Diffusion Models [75.22218338096316]
We introduce PPD, a multi-reward optimization objective that aligns diffusion models with personalized preferences.
With PPD, a diffusion model learns the individual preferences of a population of users in a few-shot way.
Our approach achieves an average win rate of 76% over Stable Cascade, generating images that more accurately reflect specific user preferences.
arXiv Detail & Related papers (2025-01-11T22:38:41Z) - Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization [97.35427957922714]
We present an algorithm named pairwise sample optimization (PSO), which enables the direct fine-tuning of an arbitrary timestep-distilled diffusion model.
PSO introduces additional reference images sampled from the current time-step distilled model, and increases the relative likelihood margin between the training images and reference images.
We show that PSO can directly adapt distilled models to human-preferred generation with both offline and online-generated pairwise preference image data.
arXiv Detail & Related papers (2024-10-04T07:05:16Z) - Aligning Diffusion Models with Noise-Conditioned Perception [42.042822966928576]
Diffusion Models typically optimize in pixel or VAE space, which does not align well with human perception.
We propose using a perceptual objective in the U-Net embedding space of the diffusion model to address these issues.
arXiv Detail & Related papers (2024-06-25T15:21:50Z) - Bridging Model-Based Optimization and Generative Modeling via Conservative Fine-Tuning of Diffusion Models [54.132297393662654]
We introduce a hybrid method that fine-tunes cutting-edge diffusion models by optimizing reward models through RL.
We demonstrate the capability of our approach to outperform the best designs in offline data, leveraging the extrapolation capabilities of reward models.
arXiv Detail & Related papers (2024-05-30T03:57:29Z) - Pixel-wise RL on Diffusion Models: Reinforcement Learning from Rich Feedback [0.0]
Latent diffusion models are the state-of-the-art for synthetic image generation.
To align these models with human preferences, training the models using reinforcement learning is crucial.
We introduce denoising diffusion policy optimisation (DDPO), which accounts for the iterative denoising nature of the generation.
We present the Pixel-wise Policy optimisation (PXPO) algorithm, which can take feedback for each pixel, providing a more nuanced reward to the model.
arXiv Detail & Related papers (2024-04-05T18:56:00Z) - Diffusion Model Alignment Using Direct Preference Optimization [103.2238655827797]
Diffusion-DPO is a method to align diffusion models to human preferences by directly optimizing on human comparison data.
We fine-tune the base model of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO.
We also develop a variant that uses AI feedback and has comparable performance to training on human preferences.
arXiv Detail & Related papers (2023-11-21T15:24:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.