LiT: Delving into a Simplified Linear Diffusion Transformer for Image Generation
- URL: http://arxiv.org/abs/2501.12976v1
- Date: Wed, 22 Jan 2025 16:02:06 GMT
- Title: LiT: Delving into a Simplified Linear Diffusion Transformer for Image Generation
- Authors: Jiahao Wang, Ning Kang, Lewei Yao, Mengzhao Chen, Chengyue Wu, Songyang Zhang, Shuchen Xue, Yong Liu, Taiqiang Wu, Xihui Liu, Kaipeng Zhang, Shifeng Zhang, Wenqi Shao, Zhenguo Li, Ping Luo,
- Abstract summary: Linear Diffusion Transformer (LiT) is an efficient text-to-image Transformer that can be deployed offline on a laptop.
LiT achieves highly competitive FID while reducing training steps by 80% and 77% compared to DiT.
For text-to-image generation, LiT allows for the rapid synthesis of up to 1K resolution photorealistic images.
- Score: 96.54620463472526
- License:
- Abstract: In commonly used sub-quadratic complexity modules, linear attention benefits from simplicity and high parallelism, making it promising for image synthesis tasks. However, the architectural design and learning strategy for linear attention remain underexplored in this field. In this paper, we offer a suite of ready-to-use solutions for efficient linear diffusion Transformers. Our core contributions include: (1) Simplified Linear Attention using few heads, observing the free-lunch effect of performance without latency increase. (2) Weight inheritance from a fully pre-trained diffusion Transformer: initializing linear Transformer using pre-trained diffusion Transformer and loading all parameters except for those related to linear attention. (3) Hybrid knowledge distillation objective: using a pre-trained diffusion Transformer to help the training of the student linear Transformer, supervising not only the predicted noise but also the variance of the reverse diffusion process. These guidelines lead to our proposed Linear Diffusion Transformer (LiT), an efficient text-to-image Transformer that can be deployed offline on a laptop. Experiments show that in class-conditional 256*256 and 512*512 ImageNet benchmark LiT achieves highly competitive FID while reducing training steps by 80% and 77% compared to DiT. LiT also rivals methods based on Mamba or Gated Linear Attention. Besides, for text-to-image generation, LiT allows for the rapid synthesis of up to 1K resolution photorealistic images. Project page: https://techmonsterwang.github.io/LiT/.
Related papers
- On Disentangled Training for Nonlinear Transform in Learned Image Compression [59.66885464492666]
Learned image compression (LIC) has demonstrated superior rate-distortion (R-D) performance compared to traditional codecs.
Existing LIC methods overlook the slow convergence caused by compacting energy in learning nonlinear transforms.
We propose a linear auxiliary transform (AuxT) to disentangle energy compaction in training nonlinear transforms.
arXiv Detail & Related papers (2025-01-23T15:32:06Z) - MonoFormer: One Transformer for Both Diffusion and Autoregression [70.81047437281583]
We propose to study a simple idea: share one transformer for both autoregression and diffusion.
Experimental results show that our approach achieves comparable image generation performance to current state-of-the-art methods.
arXiv Detail & Related papers (2024-09-24T17:51:04Z) - Parallelizing Linear Transformers with the Delta Rule over Sequence Length [49.88826673324244]
This work describes a hardware-efficient algorithm for training linear transformers with the delta rule.
We train a 1.3B model for 100B tokens and find that it outperforms recent linear-time baselines.
arXiv Detail & Related papers (2024-06-10T17:24:42Z) - ClipFormer: Key-Value Clipping of Transformers on Memristive Crossbars
for Write Noise Mitigation [6.853523674099236]
In-memory computing (IMC) crossbars based on Non-volatile Memories (NVMs) have emerged as a promising solution for accelerating transformers.
We find pre-trained Vision Transformers (ViTs) to be vulnerable on crossbars due to the impact of dynamically generate write noise.
We propose a new memristive crossbar platform to boost the non-ideal accuracies of pre-trained ViT models.
arXiv Detail & Related papers (2024-02-04T19:04:37Z) - Transformer as Linear Expansion of Learngene [38.16612771203953]
Linear Expansion of learnGene (TLEG) is a novel approach for flexibly producing and initializing Transformers of diverse depths.
Experiments on ImageNet-1K demonstrate that TLEG achieves comparable or better performance in contrast to many individual models trained from scratch.
arXiv Detail & Related papers (2023-12-09T17:01:18Z) - Linear attention is (maybe) all you need (to understand transformer
optimization) [55.81555204646486]
We make progress towards understanding the subtleties of training Transformers by studying a simple yet canonicalized shallow Transformer model.
Most importantly, we observe that our proposed linearized models can reproduce several prominent aspects of Transformer training dynamics.
arXiv Detail & Related papers (2023-10-02T10:48:42Z) - Tangent Transformers for Composition, Privacy and Removal [58.280295030852194]
Tangent Attention Fine-Tuning (TAFT) is a method for fine-tuning linearized transformers.
Tangent Attention Fine-Tuning (TAFT) is a method for fine-tuning linearized transformers.
arXiv Detail & Related papers (2023-07-16T18:31:25Z) - Momentum Transformer: Closing the Performance Gap Between Self-attention
and Its Linearization [31.28396970291575]
Leveraging techniques include sparse and linear attention and hashing tricks; efficient transformers have been proposed to reduce the quadratic complexity of transformers but significantly degrade the accuracy.
We first interpret the linear attention and residual connections in computing the attention map as gradient descent steps.
We then introduce momentum into these components and propose the emphmomentum transformer, which utilizes momentum to improve the accuracy of linear transformers while maintaining linear memory and computational complexities.
arXiv Detail & Related papers (2022-08-01T02:37:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.