PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with
Time-Decoupled Training and Reusable Coop-Diffusion
- URL: http://arxiv.org/abs/2312.16486v2
- Date: Fri, 29 Dec 2023 01:57:39 GMT
- Title: PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with
Time-Decoupled Training and Reusable Coop-Diffusion
- Authors: Guansong Lu, Yuanfan Guo, Jianhua Han, Minzhe Niu, Yihan Zeng, Songcen
Xu, Zeyi Huang, Zhao Zhong, Wei Zhang, Hang Xu
- Abstract summary: "PanGu-Draw" is a novel latent diffusion model designed for resource-efficient text-to-image synthesis.
We introduce "Coop-Diffusion", an algorithm that enables the cooperative use of various pre-trained diffusion models.
Empirical validations of Pangu-Draw show its exceptional prowess in text-to-image and multi-control image generation.
- Score: 45.06392070934473
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current large-scale diffusion models represent a giant leap forward in
conditional image synthesis, capable of interpreting diverse cues like text,
human poses, and edges. However, their reliance on substantial computational
resources and extensive data collection remains a bottleneck. On the other
hand, the integration of existing diffusion models, each specialized for
different controls and operating in unique latent spaces, poses a challenge due
to incompatible image resolutions and latent space embedding structures,
hindering their joint use. Addressing these constraints, we present
"PanGu-Draw", a novel latent diffusion model designed for resource-efficient
text-to-image synthesis that adeptly accommodates multiple control signals. We
first propose a resource-efficient Time-Decoupling Training Strategy, which
splits the monolithic text-to-image model into structure and texture
generators. Each generator is trained using a regimen that maximizes data
utilization and computational efficiency, cutting data preparation by 48% and
reducing training resources by 51%. Secondly, we introduce "Coop-Diffusion", an
algorithm that enables the cooperative use of various pre-trained diffusion
models with different latent spaces and predefined resolutions within a unified
denoising process. This allows for multi-control image synthesis at arbitrary
resolutions without the necessity for additional data or retraining. Empirical
validations of Pangu-Draw show its exceptional prowess in text-to-image and
multi-control image generation, suggesting a promising direction for future
model training efficiencies and generation versatility. The largest 5B T2I
PanGu-Draw model is released on the Ascend platform. Project page:
$\href{https://pangu-draw.github.io}{this~https~URL}$
Related papers
- MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models [34.611309081801345]
Large diffusion-based Text-to-Image (T2I) models have shown impressive generative powers for text-to-image generation.
In this paper, we propose a novel strategy to scale a generative model across new tasks with minimal compute.
arXiv Detail & Related papers (2024-04-15T17:55:56Z) - FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - Aligning Text-to-Image Diffusion Models with Reward Backpropagation [62.45086888512723]
We propose AlignProp, a method that aligns diffusion models to downstream reward functions using end-to-end backpropagation of the reward gradient.
We show AlignProp achieves higher rewards in fewer training steps than alternatives, while being conceptually simpler.
arXiv Detail & Related papers (2023-10-05T17:59:18Z) - DiffDis: Empowering Generative Diffusion Model with Cross-Modal
Discrimination Capability [75.9781362556431]
We propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.
We show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks.
arXiv Detail & Related papers (2023-08-18T05:03:48Z) - Flow Matching in Latent Space [2.9330609943398525]
Flow matching is a framework to train generative models that exhibits impressive empirical performance.
We propose to apply flow matching in the latent spaces of pretrained autoencoders, which offers improved computational efficiency.
Our work stands as a pioneering contribution in the integration of various conditions into flow matching for conditional generation tasks.
arXiv Detail & Related papers (2023-07-17T17:57:56Z) - Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets.
It considers a retrieval-then-optimization procedure to synthesize pseudo text features.
It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z) - Image Generation with Multimodal Priors using Denoising Diffusion
Probabilistic Models [54.1843419649895]
A major challenge in using generative models to accomplish this task is the lack of paired data containing all modalities and corresponding outputs.
We propose a solution based on a denoising diffusion probabilistic synthesis models to generate images under multi-model priors.
arXiv Detail & Related papers (2022-06-10T12:23:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.