Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization
- URL: http://arxiv.org/abs/2502.01051v2
- Date: Thu, 20 Mar 2025 05:36:28 GMT
- Title: Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization
- Authors: Tao Zhang, Cheng Da, Kun Ding, Kun Jin, Yan Li, Tingting Gao, Di Zhang, Shiming Xiang, Chunhong Pan,
- Abstract summary: Preference optimization for diffusion models aims to align them with human preferences for images.<n>Previous methods typically leverage Vision-Language Models (VLMs) as pixel-level reward models to approximate human preferences.<n>In this work, we demonstrate that diffusion models are inherently well-suited for step-level reward modeling in the latent space.<n>We introduce Latent Preference Optimization (LPO), a method designed for step-level preference optimization directly in the latent space.
- Score: 46.888425016169144
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Preference optimization for diffusion models aims to align them with human preferences for images. Previous methods typically leverage Vision-Language Models (VLMs) as pixel-level reward models to approximate human preferences. However, when used for step-level preference optimization, these models face challenges in handling noisy images of different timesteps and require complex transformations into pixel space. In this work, we demonstrate that diffusion models are inherently well-suited for step-level reward modeling in the latent space, as they can naturally extract features from noisy latent images. Accordingly, we propose the Latent Reward Model (LRM), which repurposes components of diffusion models to predict preferences of latent images at various timesteps. Building on LRM, we introduce Latent Preference Optimization (LPO), a method designed for step-level preference optimization directly in the latent space. Experimental results indicate that LPO not only significantly enhances performance in aligning diffusion models with general, aesthetic, and text-image alignment preferences, but also achieves 2.5-28$\times$ training speedup compared to existing preference optimization methods. Our code and models are available at https://github.com/Kwai-Kolors/LPO.
Related papers
- InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment [12.823734370183482]
We introduce DDIM-InPO, an efficient method for direct preference alignment of diffusion models.
Our approach conceptualizes diffusion model as a single-step generative model, allowing us to fine-tune the outputs of specific latent variables selectively.
Experimental results indicate that our DDIM-InPO achieves state-of-the-art performance with just 400 steps of fine-tuning.
arXiv Detail & Related papers (2025-03-24T08:58:49Z) - DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models [50.32663816994459]
Diffusion-styled Preference Optimization (model) provides an efficient and policy-agnostic solution for aligning LLMs with humans.
modelavoids the time latency associated with token-level generation.
Experiments on AlpacaEval 2, MT-bench, and HH-RLHF demonstrate that modelachieves superior alignment performance across various settings.
arXiv Detail & Related papers (2025-03-06T09:21:54Z) - Refining Alignment Framework for Diffusion Models with Intermediate-Step Preference Ranking [50.325021634589596]
We propose a Tailored Optimization Preference (TailorPO) framework for aligning diffusion models with human preference.<n>Our approach directly ranks intermediate noisy samples based on their step-wise reward, and effectively resolves the gradient direction issues.<n> Experimental results demonstrate that our method significantly improves the model's ability to generate aesthetically pleasing and human-preferred images.
arXiv Detail & Related papers (2025-02-01T16:08:43Z) - Personalized Preference Fine-tuning of Diffusion Models [75.22218338096316]
We introduce PPD, a multi-reward optimization objective that aligns diffusion models with personalized preferences.<n>With PPD, a diffusion model learns the individual preferences of a population of users in a few-shot way.<n>Our approach achieves an average win rate of 76% over Stable Cascade, generating images that more accurately reflect specific user preferences.
arXiv Detail & Related papers (2025-01-11T22:38:41Z) - Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization [97.35427957922714]
We present an algorithm named pairwise sample optimization (PSO), which enables the direct fine-tuning of an arbitrary timestep-distilled diffusion model.
PSO introduces additional reference images sampled from the current time-step distilled model, and increases the relative likelihood margin between the training images and reference images.
We show that PSO can directly adapt distilled models to human-preferred generation with both offline and online-generated pairwise preference image data.
arXiv Detail & Related papers (2024-10-04T07:05:16Z) - Aligning Diffusion Models with Noise-Conditioned Perception [42.042822966928576]
Diffusion Models typically optimize in pixel or VAE space, which does not align well with human perception.
We propose using a perceptual objective in the U-Net embedding space of the diffusion model to address these issues.
arXiv Detail & Related papers (2024-06-25T15:21:50Z) - Pixel-wise RL on Diffusion Models: Reinforcement Learning from Rich Feedback [0.0]
Latent diffusion models are the state-of-the-art for synthetic image generation.
To align these models with human preferences, training the models using reinforcement learning is crucial.
We introduce denoising diffusion policy optimisation (DDPO), which accounts for the iterative denoising nature of the generation.
We present the Pixel-wise Policy optimisation (PXPO) algorithm, which can take feedback for each pixel, providing a more nuanced reward to the model.
arXiv Detail & Related papers (2024-04-05T18:56:00Z) - Diffusion Model Alignment Using Direct Preference Optimization [103.2238655827797]
Diffusion-DPO is a method to align diffusion models to human preferences by directly optimizing on human comparison data.
We fine-tune the base model of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO.
We also develop a variant that uses AI feedback and has comparable performance to training on human preferences.
arXiv Detail & Related papers (2023-11-21T15:24:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.