LinFusion: 1 GPU, 1 Minute, 16K Image
- URL: http://arxiv.org/abs/2409.02097v3
- Date: Thu, 17 Oct 2024 08:09:37 GMT
- Title: LinFusion: 1 GPU, 1 Minute, 16K Image
- Authors: Songhua Liu, Weihao Yu, Zhenxiong Tan, Xinchao Wang,
- Abstract summary: We introduce a low-rank approximation of a wide spectrum of popular linear token mixers.
We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD.
Experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion enables satisfactory and efficient zero-shot cross-resolution generation.
- Score: 71.44735417472043
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern diffusion models, particularly those utilizing a Transformer-based UNet for denoising, rely heavily on self-attention operations to manage complex spatial relationships, thus achieving impressive generation performance. However, this existing paradigm faces significant challenges in generating high-resolution visual content due to its quadratic time and memory complexity with respect to the number of spatial tokens. To address this limitation, we aim at a novel linear attention mechanism as an alternative in this paper. Specifically, we begin our exploration from recently introduced models with linear complexity, e.g., Mamba2, RWKV6, Gated Linear Attention, etc, and identify two key features--attention normalization and non-causal inference--that enhance high-resolution visual generation performance. Building on these insights, we introduce a generalized linear attention paradigm, which serves as a low-rank approximation of a wide spectrum of popular linear token mixers. To save the training cost and better leverage pre-trained models, we initialize our models and distill the knowledge from pre-trained StableDiffusion (SD). We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD after only modest training, while significantly reducing time and memory complexity. Extensive experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion enables satisfactory and efficient zero-shot cross-resolution generation, accommodating ultra-resolution images like 16K on a single GPU. Moreover, it is highly compatible with pre-trained SD components and pipelines, such as ControlNet, IP-Adapter, DemoFusion, DistriFusion, etc, requiring no adaptation efforts. Codes are available at https://github.com/Huage001/LinFusion.
Related papers
- Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [62.06970466554273]
We present Meissonic, which non-autoregressive masked image modeling (MIM) text-to-image elevates to a level comparable with state-of-the-art diffusion models like SDXL.
We leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers to further enhance image fidelity and resolution.
Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images.
arXiv Detail & Related papers (2024-10-10T17:59:17Z) - Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models [26.926712014346432]
This paper presents innovative enhancements to diffusion models by integrating a novel multi-resolution network and time-dependent layer normalization.
Our method's efficacy is demonstrated on the class-conditional ImageNet generation benchmark, setting new state-of-the-art FID scores of 1.70 on ImageNet 256 x 256 and 2.89 on ImageNet 512 x 512.
arXiv Detail & Related papers (2024-06-13T17:59:58Z) - Latent-based Diffusion Model for Long-tailed Recognition [10.410057703866899]
Long-tailed imbalance distribution is a common issue in practical computer vision applications.
We propose a new approach, the Latent-based Diffusion Model for Long-tailed Recognition (LDMLR) as a feature augmentation method to tackle the issue.
The model's accuracy shows an improvement on the CIFAR-LT and ImageNet-LT datasets by using the proposed method.
arXiv Detail & Related papers (2024-04-06T06:15:07Z) - A-SDM: Accelerating Stable Diffusion through Redundancy Removal and
Performance Optimization [54.113083217869516]
In this work, we first explore the computational redundancy part of the network.
We then prune the redundancy blocks of the model and maintain the network performance.
Thirdly, we propose a global-regional interactive (GRI) attention to speed up the computationally intensive attention part.
arXiv Detail & Related papers (2023-12-24T15:37:47Z) - HartleyMHA: Self-Attention in Frequency Domain for Resolution-Robust and
Parameter-Efficient 3D Image Segmentation [4.48473804240016]
We introduce the HartleyMHA model which is robust to training image resolution with efficient self-attention.
We modify the FNO by using the Hartley transform with shared parameters to reduce the model size by orders of magnitude.
When tested on the BraTS'19 dataset, it achieved superior robustness to training image resolution than other tested models with less than 1% of their model parameters.
arXiv Detail & Related papers (2023-10-05T18:44:41Z) - Breaking Through the Haze: An Advanced Non-Homogeneous Dehazing Method
based on Fast Fourier Convolution and ConvNeXt [14.917290578644424]
Haze usually leads to deteriorated images with low contrast, color shift and structural distortion.
We propose a novel two branch network that leverages 2D discrete wavelete transform (DWT), fast Fourier convolution (FFC) residual block and a pretrained ConvNeXt model.
Our model is able to effectively explore global contextual information and produce images with better perceptual quality.
arXiv Detail & Related papers (2023-05-08T02:59:02Z) - CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets [50.6643933702394]
We present a single-model self-supervised hybrid pre-training framework for RGB and depth modalities, termed as CoMAE.
Our CoMAE presents a curriculum learning strategy to unify the two popular self-supervised representation learning algorithms: contrastive learning and masked image modeling.
arXiv Detail & Related papers (2023-02-13T07:09:45Z) - Improved Transformer for High-Resolution GANs [69.42469272015481]
We introduce two key ingredients to Transformer to address this challenge.
We show in the experiments that the proposed HiT achieves state-of-the-art FID scores of 31.87 and 2.95 on unconditional ImageNet $128 times 128$ and FFHQ $256 times 256$, respectively.
arXiv Detail & Related papers (2021-06-14T17:39:49Z) - Collaborative Distillation for Ultra-Resolution Universal Style Transfer [71.18194557949634]
We present a new knowledge distillation method (named Collaborative Distillation) for encoder-decoder based neural style transfer.
We achieve ultra-resolution (over 40 megapixels) universal style transfer on a 12GB GPU for the first time.
arXiv Detail & Related papers (2020-03-18T18:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.