Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image
Diffusion Models
- URL: http://arxiv.org/abs/2306.00637v2
- Date: Fri, 29 Sep 2023 05:32:46 GMT
- Title: Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image
Diffusion Models
- Authors: Pablo Pernias, Dominic Rampas, Mats L. Richter, Christopher J. Pal and
Marc Aubreville
- Abstract summary: W"urstchen is a novel architecture for text-to-image synthesis that combines competitive performance with unprecedented cost-effectiveness.
A key contribution of our work is to develop a latent diffusion technique in which we learn a detailed but extremely compact semantic image representation.
- Score: 6.821399706256863
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce W\"urstchen, a novel architecture for text-to-image synthesis
that combines competitive performance with unprecedented cost-effectiveness for
large-scale text-to-image diffusion models. A key contribution of our work is
to develop a latent diffusion technique in which we learn a detailed but
extremely compact semantic image representation used to guide the diffusion
process. This highly compressed representation of an image provides much more
detailed guidance compared to latent representations of language and this
significantly reduces the computational requirements to achieve
state-of-the-art results. Our approach also improves the quality of
text-conditioned image generation based on our user preference study. The
training requirements of our approach consists of 24,602 A100-GPU hours -
compared to Stable Diffusion 2.1's 200,000 GPU hours. Our approach also
requires less training data to achieve these results. Furthermore, our compact
latent representations allows us to perform inference over twice as fast,
slashing the usual costs and carbon footprint of a state-of-the-art (SOTA)
diffusion model significantly, without compromising the end performance. In a
broader comparison against SOTA models our approach is substantially more
efficient and compares favorably in terms of image quality. We believe that
this work motivates more emphasis on the prioritization of both performance and
computational accessibility.
Related papers
- YaART: Yet Another ART Rendering Technology [119.09155882164573]
This study introduces YaART, a novel production-grade text-to-image cascaded diffusion model aligned to human preferences.
We analyze how these choices affect both the efficiency of the training process and the quality of the generated images.
We demonstrate that models trained on smaller datasets of higher-quality images can successfully compete with those trained on larger datasets.
arXiv Detail & Related papers (2024-04-08T16:51:19Z) - SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions [5.100085108873068]
We present two models, SDXS-512 and SDXS-1024, achieving inference speeds of approximately 100 FPS (30x faster than SD v1.5) and 30 FPS (60x faster than SDXL) on a single GPU.
Our training approach offers promising applications in image-conditioned control, facilitating efficient image-to-image translation.
arXiv Detail & Related papers (2024-03-25T11:16:23Z) - Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers [120.49126407479717]
This paper explores text-to-image diffusion models for Zero-Shot Sketch-based Image Retrieval (ZS-SBIR)
We highlight a pivotal discovery: the capacity of text-to-image diffusion models to seamlessly bridge the gap between sketches and photos.
arXiv Detail & Related papers (2024-03-12T00:02:03Z) - E$^{2}$GAN: Efficient Training of Efficient GANs for Image-to-Image Translation [69.72194342962615]
We introduce and address a novel research direction: can the process of distilling GANs from diffusion models be made significantly more efficient?
First, we construct a base GAN model with generalized features, adaptable to different concepts through fine-tuning, eliminating the need for training from scratch.
Second, we identify crucial layers within the base GAN model and employ Low-Rank Adaptation (LoRA) with a simple yet effective rank search process, rather than fine-tuning the entire base model.
Third, we investigate the minimal amount of data necessary for fine-tuning, further reducing the overall training time.
arXiv Detail & Related papers (2024-01-11T18:59:14Z) - DiffDis: Empowering Generative Diffusion Model with Cross-Modal
Discrimination Capability [75.9781362556431]
We propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.
We show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks.
arXiv Detail & Related papers (2023-08-18T05:03:48Z) - SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two
Seconds [88.06788636008051]
Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers.
These models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run.
We present a generic approach that unlocks running text-to-image diffusion models on mobile devices in less than $2$ seconds.
arXiv Detail & Related papers (2023-06-01T17:59:25Z) - Joint Adaptive Representations for Image-Language Learning [59.40890927221377]
We propose a recipe for image-language learning, which produces effective models, outperforming bigger and more expensive ones, often trained on orders of magnitude larger datasets.
Our key finding is the joint learning of a compact vision and language representation, which adaptively and iteratively fuses the multi-modal features.
With only 40M training examples and with 39 GFLOPs our lightweight model outperforms many times larger state-of-the-art models of 2-20x more FLOPs and using bigger datasets some of which with close to 1B training examples.
arXiv Detail & Related papers (2023-05-31T15:02:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.