SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
- URL: http://arxiv.org/abs/2410.10629v3
- Date: Sun, 20 Oct 2024 14:35:31 GMT
- Title: SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
- Authors: Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han,
- Abstract summary: Sana is a text-to-image framework that can generate images up to 4096$times$4096 resolution.
Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU.
- Score: 41.79064227895747
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096$\times$4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8$\times$, we trained an AE that can compress images 32$\times$, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024$\times$1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.
Related papers
- CMC-Bench: Towards a New Paradigm of Visual Signal Compression [85.1839779884282]
We introduce CMC-Bench, a benchmark of the cooperative performance of Image-to-Text (I2T) and Text-to-Image (T2I) models for image compression.
At ultra-lows, this paper proves that the combination of some I2T and T2I models has surpassed the most advanced visual signal protocols.
arXiv Detail & Related papers (2024-06-13T17:41:37Z) - TextCraftor: Your Text Encoder Can be Image Quality Controller [65.27457900325462]
Diffusion-based text-to-image generative models, e.g., Stable Diffusion, have revolutionized the field of content generation.
We propose a proposed fine-tuning approach, TextCraftor, to enhance the performance of text-to-image diffusion models.
arXiv Detail & Related papers (2024-03-27T19:52:55Z) - PixArt-$\alpha$: Fast Training of Diffusion Transformer for
Photorealistic Text-to-Image Synthesis [108.83343447275206]
This paper introduces PIXART-$alpha$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators.
It supports high-resolution image synthesis up to 1024px resolution with low training cost.
Tests demonstrate that PIXART-$alpha$ excels in image quality, artistry, and semantic control.
arXiv Detail & Related papers (2023-09-30T16:18:00Z) - Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size
HD Images [56.17404812357676]
Stable diffusion, a generative model used in text-to-image synthesis, frequently encounters composition problems when generating images of varying sizes.
We propose a two-stage pipeline named Any-Size-Diffusion (ASD), designed to efficiently generate well-composed images of any size.
We show that ASD can produce well-structured images of arbitrary sizes, cutting down the inference time by 2x compared to the traditional tiled algorithm.
arXiv Detail & Related papers (2023-08-31T09:27:56Z) - Extreme Generative Image Compression by Learning Text Embedding from
Diffusion Models [13.894251782142584]
We propose a generative image compression method that demonstrates the potential of saving an image as a short text embedding.
Our method outperforms other state-of-the-art deep learning methods in terms of both perceptual quality and diversity.
arXiv Detail & Related papers (2022-11-14T22:54:19Z) - PILC: Practical Image Lossless Compression with an End-to-end GPU
Oriented Neural Framework [88.18310777246735]
We propose an end-to-end image compression framework that achieves 200 MB/s for both compression and decompression with a single NVIDIA Tesla V100 GPU.
Experiments show that our framework compresses better than PNG by a margin of 30% in multiple datasets.
arXiv Detail & Related papers (2022-06-10T03:00:10Z) - One-shot Ultra-high-Resolution Generative Adversarial Network That
Synthesizes 16K Images On A Single GPU [1.9060575156739825]
OUR-GAN is a one-shot generative adversarial network framework that generates non-repetitive 16K images from a single training image.
OUR-GAN can synthesize high-quality 16K images with 12.5 GB of GPU memory and 4K images with only 4.29 GB.
OUR-GAN is the first one-shot image synthesizer that generates non-repetitive UHR images on a single consumer GPU.
arXiv Detail & Related papers (2022-02-28T13:48:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.