Refining Alignment Framework for Diffusion Models with Intermediate-Step Preference Ranking
- URL: http://arxiv.org/abs/2502.01667v1
- Date: Sat, 01 Feb 2025 16:08:43 GMT
- Title: Refining Alignment Framework for Diffusion Models with Intermediate-Step Preference Ranking
- Authors: Jie Ren, Yuhang Zhang, Dongrui Liu, Xiaopeng Zhang, Qi Tian,
- Abstract summary: We propose a Tailored Optimization Preference (TailorPO) framework for aligning diffusion models with human preference.
Our approach directly ranks intermediate noisy samples based on their step-wise reward, and effectively resolves the gradient direction issues.
Experimental results demonstrate that our method significantly improves the model's ability to generate aesthetically pleasing and human-preferred images.
- Score: 50.325021634589596
- License:
- Abstract: Direct preference optimization (DPO) has shown success in aligning diffusion models with human preference. Previous approaches typically assume a consistent preference label between final generations and noisy samples at intermediate steps, and directly apply DPO to these noisy samples for fine-tuning. However, we theoretically identify inherent issues in this assumption and its impacts on the effectiveness of preference alignment. We first demonstrate the inherent issues from two perspectives: gradient direction and preference order, and then propose a Tailored Preference Optimization (TailorPO) framework for aligning diffusion models with human preference, underpinned by some theoretical insights. Our approach directly ranks intermediate noisy samples based on their step-wise reward, and effectively resolves the gradient direction issues through a simple yet efficient design. Additionally, we incorporate the gradient guidance of diffusion models into preference alignment to further enhance the optimization effectiveness. Experimental results demonstrate that our method significantly improves the model's ability to generate aesthetically pleasing and human-preferred images.
Related papers
- Calibrated Multi-Preference Optimization for Aligning Diffusion Models [92.90660301195396]
Calibrated Preference Optimization (CaPO) is a novel method to align text-to-image (T2I) diffusion models.
CaPO incorporates the general preference from multiple reward models without human annotated data.
Experimental results show that CaPO consistently outperforms prior methods.
arXiv Detail & Related papers (2025-02-04T18:59:23Z) - Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization [46.888425016169144]
Preference optimization for diffusion models aims to align them with human preferences for images.
Previous methods typically leverage Vision-Language Models (VLMs) as pixel-level reward models to approximate human preferences.
In this work, we demonstrate that diffusion models are inherently well-suited for step-level reward modeling in the latent space.
We introduce Latent Preference Optimization (LPO), a method designed for step-level preference optimization directly in the latent space.
arXiv Detail & Related papers (2025-02-03T04:51:28Z) - Personalized Preference Fine-tuning of Diffusion Models [75.22218338096316]
We introduce PPD, a multi-reward optimization objective that aligns diffusion models with personalized preferences.
With PPD, a diffusion model learns the individual preferences of a population of users in a few-shot way.
Our approach achieves an average win rate of 76% over Stable Cascade, generating images that more accurately reflect specific user preferences.
arXiv Detail & Related papers (2025-01-11T22:38:41Z) - Aligning Few-Step Diffusion Models with Dense Reward Difference Learning [81.85515625591884]
Stepwise Diffusion Policy Optimization (SDPO) is an alignment method tailored for few-step diffusion models.
SDPO incorporates dense reward feedback at every intermediate step to ensure consistent alignment across all denoising steps.
SDPO consistently outperforms prior methods in reward-based alignment across diverse step configurations.
arXiv Detail & Related papers (2024-11-18T16:57:41Z) - Towards Improved Preference Optimization Pipeline: from Data Generation to Budget-Controlled Regularization [14.50339880957898]
We aim to improve the preference optimization pipeline by taking a closer look at preference data generation and training regularization techniques.
For preference data generation, we propose an iterative pairwise ranking mechanism that derives preference ranking of completions using pairwise comparison signals.
For training regularization, we observe that preference optimization tends to achieve better convergence when the LLM predicted likelihood of preferred samples gets slightly reduced.
arXiv Detail & Related papers (2024-11-07T23:03:11Z) - Training-free Diffusion Model Alignment with Sampling Demons [15.400553977713914]
We propose an optimization approach, dubbed Demon, to guide the denoising process at inference time without backpropagation through reward functions or model retraining.
Our approach works by controlling noise distribution in denoising steps to concentrate density on regions corresponding to high rewards through optimization.
To the best of our knowledge, the proposed approach is the first inference-time, backpropagation-free preference alignment method for diffusion models.
arXiv Detail & Related papers (2024-10-08T07:33:49Z) - Gradient Guidance for Diffusion Models: An Optimization Perspective [45.6080199096424]
This paper studies a form of gradient guidance for adapting a pre-trained diffusion model towards optimizing user-specified objectives.
We establish a mathematical framework for guided diffusion to systematically study its optimization theory and algorithmic design.
arXiv Detail & Related papers (2024-04-23T04:51:02Z) - Diffusion Model Alignment Using Direct Preference Optimization [103.2238655827797]
Diffusion-DPO is a method to align diffusion models to human preferences by directly optimizing on human comparison data.
We fine-tune the base model of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO.
We also develop a variant that uses AI feedback and has comparable performance to training on human preferences.
arXiv Detail & Related papers (2023-11-21T15:24:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.