Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation
- URL: http://arxiv.org/abs/2402.10210v1
- Date: Thu, 15 Feb 2024 18:59:18 GMT
- Title: Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation
- Authors: Huizhuo Yuan and Zixiang Chen and Kaixuan Ji and Quanquan Gu
- Abstract summary: Fine-tuning Diffusion Models remains an underexplored frontier in generative artificial intelligence (GenAI)
In this paper, we introduce an innovative technique called self-play fine-tuning for diffusion models (SPIN-Diffusion)
Our approach offers an alternative to conventional supervised fine-tuning and RL strategies, significantly improving both model performance and alignment.
- Score: 59.184980778643464
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fine-tuning Diffusion Models remains an underexplored frontier in generative
artificial intelligence (GenAI), especially when compared with the remarkable
progress made in fine-tuning Large Language Models (LLMs). While cutting-edge
diffusion models such as Stable Diffusion (SD) and SDXL rely on supervised
fine-tuning, their performance inevitably plateaus after seeing a certain
volume of data. Recently, reinforcement learning (RL) has been employed to
fine-tune diffusion models with human preference data, but it requires at least
two images ("winner" and "loser" images) for each text prompt. In this paper,
we introduce an innovative technique called self-play fine-tuning for diffusion
models (SPIN-Diffusion), where the diffusion model engages in competition with
its earlier versions, facilitating an iterative self-improvement process. Our
approach offers an alternative to conventional supervised fine-tuning and RL
strategies, significantly improving both model performance and alignment. Our
experiments on the Pick-a-Pic dataset reveal that SPIN-Diffusion outperforms
the existing supervised fine-tuning method in aspects of human preference
alignment and visual appeal right from its first iteration. By the second
iteration, it exceeds the performance of RLHF-based methods across all metrics,
achieving these results with less data.
Related papers
- Energy-Based Diffusion Language Models for Text Generation [126.23425882687195]
Energy-based Diffusion Language Model (EDLM) is an energy-based model operating at the full sequence level for each diffusion step.
Our framework offers a 1.3$times$ sampling speedup over existing diffusion models.
arXiv Detail & Related papers (2024-10-28T17:25:56Z) - Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization [97.35427957922714]
We present an algorithm named pairwise sample optimization (PSO), which enables the direct fine-tuning of an arbitrary timestep-distilled diffusion model.
PSO introduces additional reference images sampled from the current time-step distilled model, and increases the relative likelihood margin between the training images and reference images.
We show that PSO can directly adapt distilled models to human-preferred generation with both offline and online-generated pairwise preference image data.
arXiv Detail & Related papers (2024-10-04T07:05:16Z) - Pruning then Reweighting: Towards Data-Efficient Training of Diffusion Models [33.09663675904689]
We investigate efficient diffusion training from the perspective of dataset pruning.
Inspired by the principles of data-efficient training for generative models such as generative adversarial networks (GANs), we first extend the data selection scheme used in GANs to DM training.
To further improve the generation performance, we employ a class-wise reweighting approach.
arXiv Detail & Related papers (2024-09-27T20:21:19Z) - Text-to-Image Rectified Flow as Plug-and-Play Priors [52.586838532560755]
Rectified flow is a novel class of generative models that enforces a linear progression from the source to the target distribution.
We show that rectified flow approaches surpass in terms of generation quality and efficiency, requiring fewer inference steps.
Our method also displays competitive performance in image inversion and editing.
arXiv Detail & Related papers (2024-06-05T14:02:31Z) - Plug-and-Play Diffusion Distillation [14.359953671470242]
We propose a new distillation approach for guided diffusion models.
An external lightweight guide model is trained while the original text-to-image model remains frozen.
We show that our method reduces the inference of classifier-free guided latent-space diffusion models by almost half.
arXiv Detail & Related papers (2024-06-04T04:22:47Z) - DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception [78.26734070960886]
Current perceptive models heavily depend on resource-intensive datasets.
We introduce perception-aware loss (P.A. loss) through segmentation, improving both quality and controllability.
Our method customizes data augmentation by extracting and utilizing perception-aware attribute (P.A. Attr) during generation.
arXiv Detail & Related papers (2024-03-20T04:58:03Z) - Guided Diffusion from Self-Supervised Diffusion Features [49.78673164423208]
Guidance serves as a key concept in diffusion models, yet its effectiveness is often limited by the need for extra data annotation or pretraining.
We propose a framework to extract guidance from, and specifically for, diffusion models.
arXiv Detail & Related papers (2023-12-14T11:19:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.