Diff-Instruct++: Training One-step Text-to-image Generator Model to Align with Human Preferences
- URL: http://arxiv.org/abs/2410.18881v1
- Date: Thu, 24 Oct 2024 16:17:18 GMT
- Title: Diff-Instruct++: Training One-step Text-to-image Generator Model to Align with Human Preferences
- Authors: Weijian Luo,
- Abstract summary: We introduce Diff-Instruct++ (DI++), the first, fast-converging and image data-free human preference alignment method for one-step text-to-image generators.
In the experiment sections, we align both UNet-based and DiT-based one-step generators using DI++, which use the Stable Diffusion 1.5 and the PixelArt-$alpha$ as the reference diffusion processes.
The resulting DiT-based one-step text-to-image model achieves a strong Aesthetic Score of 6.19 and an Image Reward of 1.24 on the validation prompt dataset
- Score: 0.0
- License:
- Abstract: One-step text-to-image generator models offer advantages such as swift inference efficiency, flexible architectures, and state-of-the-art generation performance. In this paper, we study the problem of aligning one-step generator models with human preferences for the first time. Inspired by the success of reinforcement learning using human feedback (RLHF), we formulate the alignment problem as maximizing expected human reward functions while adding an Integral Kullback-Leibler divergence term to prevent the generator from diverging. By overcoming technical challenges, we introduce Diff-Instruct++ (DI++), the first, fast-converging and image data-free human preference alignment method for one-step text-to-image generators. We also introduce novel theoretical insights, showing that using CFG for diffusion distillation is secretly doing RLHF with DI++. Such an interesting finding brings understanding and potential contributions to future research involving CFG. In the experiment sections, we align both UNet-based and DiT-based one-step generators using DI++, which use the Stable Diffusion 1.5 and the PixelArt-$\alpha$ as the reference diffusion processes. The resulting DiT-based one-step text-to-image model achieves a strong Aesthetic Score of 6.19 and an Image Reward of 1.24 on the COCO validation prompt dataset. It also achieves a leading Human preference Score (HPSv2.0) of 28.48, outperforming other open-sourced models such as Stable Diffusion XL, DMD2, SD-Turbo, as well as PixelArt-$\alpha$. Both theoretical contributions and empirical evidence indicate that DI++ is a strong human-preference alignment approach for one-step text-to-image models.
Related papers
- Diff-Instruct*: Towards Human-Preferred One-step Text-to-image Generative Models [8.352666876052616]
We introduce the Diff-Instruct*(DI*), a data-free approach for building one-step text-to-image generative models.
With Stable Diffusion V1.5 as the reference diffusion model, DI* outperforms emphall previously leading models by a large margin.
arXiv Detail & Related papers (2024-10-28T10:26:19Z) - Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective [52.778766190479374]
Latent-based image generative models have achieved notable success in image generation tasks.
Despite sharing the same latent space, autoregressive models significantly lag behind LDMs and MIMs in image generation.
We propose a simple but effective discrete image tokenizer to stabilize the latent space for image generative modeling.
arXiv Detail & Related papers (2024-10-16T12:13:17Z) - Derivative-Free Guidance in Continuous and Discrete Diffusion Models with Soft Value-Based Decoding [84.3224556294803]
Diffusion models excel at capturing the natural design spaces of images, molecules, DNA, RNA, and protein sequences.
We aim to optimize downstream reward functions while preserving the naturalness of these design spaces.
Our algorithm integrates soft value functions, which looks ahead to how intermediate noisy states lead to high rewards in the future.
arXiv Detail & Related papers (2024-08-15T16:47:59Z) - PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with
Time-Decoupled Training and Reusable Coop-Diffusion [45.06392070934473]
"PanGu-Draw" is a novel latent diffusion model designed for resource-efficient text-to-image synthesis.
We introduce "Coop-Diffusion", an algorithm that enables the cooperative use of various pre-trained diffusion models.
Empirical validations of Pangu-Draw show its exceptional prowess in text-to-image and multi-control image generation.
arXiv Detail & Related papers (2023-12-27T09:21:45Z) - Reinforcement Learning from Diffusion Feedback: Q* for Image Search [2.5835347022640254]
We present two models for image generation using model-agnostic learning.
RLDF is a singular approach for visual imitation through prior-preserving reward function guidance.
It generates high-quality images over varied domains showcasing class-consistency and strong visual diversity.
arXiv Detail & Related papers (2023-11-27T09:20:12Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - DiffDis: Empowering Generative Diffusion Model with Cross-Modal
Discrimination Capability [75.9781362556431]
We propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.
We show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks.
arXiv Detail & Related papers (2023-08-18T05:03:48Z) - DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion
Models [97.31200133440308]
We propose using online reinforcement learning to fine-tune text-to-image models.
We focus on diffusion models, defining the fine-tuning task as an RL problem.
Our approach, coined DPOK, integrates policy optimization with KL regularization.
arXiv Detail & Related papers (2023-05-25T17:35:38Z) - Human Preference Score: Better Aligning Text-to-Image Models with Human
Preference [41.270068272447055]
We collect a dataset of human choices on generated images from the Stable Foundation Discord channel.
Our experiments demonstrate that current evaluation metrics for generative models do not correlate well with human choices.
We propose a simple yet effective method to adapt Stable Diffusion to better align with human preferences.
arXiv Detail & Related papers (2023-03-25T10:09:03Z) - Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets.
It considers a retrieval-then-optimization procedure to synthesize pseudo text features.
It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.