KOALA: Empirical Lessons Toward Memory-Efficient and Fast Diffusion Models for Text-to-Image Synthesis
- URL: http://arxiv.org/abs/2312.04005v3
- Date: Thu, 21 Nov 2024 23:22:52 GMT
- Title: KOALA: Empirical Lessons Toward Memory-Efficient and Fast Diffusion Models for Text-to-Image Synthesis
- Authors: Youngwan Lee, Kwanyong Park, Yoorhim Cho, Yong-Ju Lee, Sung Ju Hwang,
- Abstract summary: We present three key practices in building an efficient text-to-image model.
Based on these findings, we build two types of efficient text-to-image models, called KOALA-Turbo &-Lightning.
Unlike SDXL, our KOALA models can generate 1024px high-resolution images on consumer-grade GPUs with 8GB of VRAMs (3060Ti)
- Score: 52.42320594388199
- License:
- Abstract: As text-to-image (T2I) synthesis models increase in size, they demand higher inference costs due to the need for more expensive GPUs with larger memory, which makes it challenging to reproduce these models in addition to the restricted access to training datasets. Our study aims to reduce these inference costs and explores how far the generative capabilities of T2I models can be extended using only publicly available datasets and open-source models. To this end, by using the de facto standard text-to-image model, Stable Diffusion XL (SDXL), we present three key practices in building an efficient T2I model: (1) Knowledge distillation: we explore how to effectively distill the generation capability of SDXL into an efficient U-Net and find that self-attention is the most crucial part. (2) Data: despite fewer samples, high-resolution images with rich captions are more crucial than a larger number of low-resolution images with short captions. (3) Teacher: Step-distilled Teacher allows T2I models to reduce the noising steps. Based on these findings, we build two types of efficient text-to-image models, called KOALA-Turbo &-Lightning, with two compact U-Nets (1B & 700M), reducing the model size up to 54% and 69% of the SDXL U-Net. In particular, the KOALA-Lightning-700M is 4x faster than SDXL while still maintaining satisfactory generation quality. Moreover, unlike SDXL, our KOALA models can generate 1024px high-resolution images on consumer-grade GPUs with 8GB of VRAMs (3060Ti). We believe that our KOALA models will have a significant practical impact, serving as cost-effective alternatives to SDXL for academic researchers and general users in resource-constrained environments.
Related papers
- Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion [34.70370851239368]
We show that pixel-space models can in fact be very competitive to latent approaches both in quality and efficiency.
We present a simple recipe for scaling end-to-end pixel-space diffusion models to high resolutions.
arXiv Detail & Related papers (2024-10-25T06:20:06Z) - ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization [59.72782742378666]
We propose Reward-based Noise Optimization (ReNO) to enhance Text-to-Image models at inference.
Within a computational budget of 20-50 seconds, ReNO-enhanced one-step models consistently surpass the performance of all current open-source Text-to-Image models.
arXiv Detail & Related papers (2024-06-06T17:56:40Z) - SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions [5.100085108873068]
We present two models, SDXS-512 and SDXS-1024, achieving inference speeds of approximately 100 FPS (30x faster than SD v1.5) and 30 FPS (60x faster than SDXL) on a single GPU.
Our training approach offers promising applications in image-conditioned control, facilitating efficient image-to-image translation.
arXiv Detail & Related papers (2024-03-25T11:16:23Z) - Diffusion Model Compression for Image-to-Image Translation [25.46012859377184]
We propose a novel compression method tailored for diffusion-based I2I models.
Based on the observations that the image conditions of I2I models already provide rich information on image structures, we develop surprisingly simple yet effective approaches for reducing the model size and latency.
Our approach achieves satisfactory output quality with 39.2%, 56.4% and 39.2% reduction in model footprint, as well as 81.4%, 68.7% and 31.1% decrease in latency to InstructPix2Pix, StableSR and ControlNet, respectively.
arXiv Detail & Related papers (2024-01-31T02:25:52Z) - A-SDM: Accelerating Stable Diffusion through Redundancy Removal and
Performance Optimization [54.113083217869516]
In this work, we first explore the computational redundancy part of the network.
We then prune the redundancy blocks of the model and maintain the network performance.
Thirdly, we propose a global-regional interactive (GRI) attention to speed up the computationally intensive attention part.
arXiv Detail & Related papers (2023-12-24T15:37:47Z) - CommonCanvas: An Open Diffusion Model Trained with Creative-Commons
Images [19.62509002853736]
We assemble a dataset of Creative-Commons-licensed (CC) images to train text-to-image generative models.
We use an intuitive transfer learning technique to produce a set of high-quality synthetic captions paired with curated CC images.
We develop a data- and compute-efficient training recipe that requires as little as 3% of the LAION-2B data needed to train existing SD2 models, but obtains comparable quality.
arXiv Detail & Related papers (2023-10-25T17:56:07Z) - SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two
Seconds [88.06788636008051]
Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers.
These models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run.
We present a generic approach that unlocks running text-to-image diffusion models on mobile devices in less than $2$ seconds.
arXiv Detail & Related papers (2023-06-01T17:59:25Z) - Joint Adaptive Representations for Image-Language Learning [59.40890927221377]
We propose a recipe for image-language learning, which produces effective models, outperforming bigger and more expensive ones, often trained on orders of magnitude larger datasets.
Our key finding is the joint learning of a compact vision and language representation, which adaptively and iteratively fuses the multi-modal features.
With only 40M training examples and with 39 GFLOPs our lightweight model outperforms many times larger state-of-the-art models of 2-20x more FLOPs and using bigger datasets some of which with close to 1B training examples.
arXiv Detail & Related papers (2023-05-31T15:02:02Z) - BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion [3.1092085121563526]
Text-to-image (T2I) generation with Stable Diffusion models (SDMs) involves high computing demands.
Recent studies have reduced sampling steps and applied network quantization while retaining the original architectures.
We uncover the surprising potential of block pruning and feature distillation for low-cost general-purpose T2I.
arXiv Detail & Related papers (2023-05-25T07:28:28Z) - HoloDiffusion: Training a 3D Diffusion Model using 2D Images [71.1144397510333]
We introduce a new diffusion setup that can be trained, end-to-end, with only posed 2D images for supervision.
We show that our diffusion models are scalable, train robustly, and are competitive in terms of sample quality and fidelity to existing approaches for 3D generative modeling.
arXiv Detail & Related papers (2023-03-29T07:35:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.