LiT: Delving into a Simplified Linear Diffusion Transformer for Image Generation
- URL: http://arxiv.org/abs/2501.12976v1
- Date: Wed, 22 Jan 2025 16:02:06 GMT
- Title: LiT: Delving into a Simplified Linear Diffusion Transformer for Image Generation
- Authors: Jiahao Wang, Ning Kang, Lewei Yao, Mengzhao Chen, Chengyue Wu, Songyang Zhang, Shuchen Xue, Yong Liu, Taiqiang Wu, Xihui Liu, Kaipeng Zhang, Shifeng Zhang, Wenqi Shao, Zhenguo Li, Ping Luo,
- Abstract summary: Linear Diffusion Transformer (LiT) is an efficient text-to-image Transformer that can be deployed offline on a laptop.<n>LiT achieves highly competitive FID while reducing training steps by 80% and 77% compared to DiT.<n>For text-to-image generation, LiT allows for the rapid synthesis of up to 1K resolution photorealistic images.
- Score: 96.54620463472526
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In commonly used sub-quadratic complexity modules, linear attention benefits from simplicity and high parallelism, making it promising for image synthesis tasks. However, the architectural design and learning strategy for linear attention remain underexplored in this field. In this paper, we offer a suite of ready-to-use solutions for efficient linear diffusion Transformers. Our core contributions include: (1) Simplified Linear Attention using few heads, observing the free-lunch effect of performance without latency increase. (2) Weight inheritance from a fully pre-trained diffusion Transformer: initializing linear Transformer using pre-trained diffusion Transformer and loading all parameters except for those related to linear attention. (3) Hybrid knowledge distillation objective: using a pre-trained diffusion Transformer to help the training of the student linear Transformer, supervising not only the predicted noise but also the variance of the reverse diffusion process. These guidelines lead to our proposed Linear Diffusion Transformer (LiT), an efficient text-to-image Transformer that can be deployed offline on a laptop. Experiments show that in class-conditional 256*256 and 512*512 ImageNet benchmark LiT achieves highly competitive FID while reducing training steps by 80% and 77% compared to DiT. LiT also rivals methods based on Mamba or Gated Linear Attention. Besides, for text-to-image generation, LiT allows for the rapid synthesis of up to 1K resolution photorealistic images. Project page: https://techmonsterwang.github.io/LiT/.
Related papers
- LINA: Linear Autoregressive Image Generative Models with Continuous Tokens [56.80443965097921]
Autoregressive models with continuous tokens form a promising paradigm for visual generation, especially for text-to-image (T2I) synthesis.<n>We study how to design compute-efficient linear attention within this framework.<n>We present LINA, a simple and compute-efficient T2I model built entirely on linear attention, capable of generating high-fidelity 1024x1024 images from user instructions.
arXiv Detail & Related papers (2026-01-30T06:44:33Z) - Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think [63.25744258438214]
REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models.<n>We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully harnessing the potential of discriminative representations.<n>We propose Representation Entanglement for Generation (REG), which entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising.
arXiv Detail & Related papers (2025-07-02T08:29:18Z) - MoiréXNet: Adaptive Multi-Scale Demoiréing with Linear Attention Test-Time Training and Truncated Flow Matching Prior [11.753823187605033]
This paper introduces a novel framework for image and video demoir'eing by integrating A Posteriori (MAP) estimation with advanced deep learning techniques.
arXiv Detail & Related papers (2025-06-19T00:15:07Z) - REPA Works Until It Doesn't: Early-Stopped, Holistic Alignment Supercharges Diffusion Training [58.33728862521732]
Diffusion Transformers (DiTs) deliver state-of-the-art image quality, yet their training remains notoriously slow.<n>A recent remedy -- representation alignment (REPA) that matches DiT hidden features to those of a non-generative teacher (e.g. DINO) -- dramatically accelerates the early epochs but plateaus or even degrades performance later.<n>We trace this failure to a capacity mismatch: once the generative student begins modelling the joint data distribution, the teacher's lower-dimensional embeddings and attention patterns become a straitjacket rather than a guide.<n>We introduce HASTE
arXiv Detail & Related papers (2025-05-22T15:34:33Z) - EDiT: Efficient Diffusion Transformers with Linear Compressed Attention [11.36660486878447]
quadratic scaling properties of the attention in DiTs hinder image generation with higher resolution or on devices with limited resources.
We introduce an efficient diffusion transformer (EDiT) to alleviate these efficiency bottlenecks.
We demonstrate the effectiveness of the EDiT and MM-EDiT architectures by integrating them into PixArt-Sigma(conventional DiT) and Stable Diffusion 3.5-Medium (MM-DiT)
arXiv Detail & Related papers (2025-03-20T21:58:45Z) - On Disentangled Training for Nonlinear Transform in Learned Image Compression [59.66885464492666]
Learned image compression (LIC) has demonstrated superior rate-distortion (R-D) performance compared to traditional codecs.
Existing LIC methods overlook the slow convergence caused by compacting energy in learning nonlinear transforms.
We propose a linear auxiliary transform (AuxT) to disentangle energy compaction in training nonlinear transforms.
arXiv Detail & Related papers (2025-01-23T15:32:06Z) - CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up [64.38715211969516]
We introduce a convolution-like local attention strategy termed CLEAR, which limits feature interactions to a local window around each query token.<n>Experiments indicate that, by fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity.
arXiv Detail & Related papers (2024-12-20T17:57:09Z) - MonoFormer: One Transformer for Both Diffusion and Autoregression [70.81047437281583]
We propose to study a simple idea: share one transformer for both autoregression and diffusion.
Experimental results show that our approach achieves comparable image generation performance to current state-of-the-art methods.
arXiv Detail & Related papers (2024-09-24T17:51:04Z) - Parallelizing Linear Transformers with the Delta Rule over Sequence Length [49.88826673324244]
This work describes a hardware-efficient algorithm for training linear transformers with the delta rule.
We train a 1.3B model for 100B tokens and find that it outperforms recent linear-time baselines.
arXiv Detail & Related papers (2024-06-10T17:24:42Z) - Transformer as Linear Expansion of Learngene [38.16612771203953]
Linear Expansion of learnGene (TLEG) is a novel approach for flexibly producing and initializing Transformers of diverse depths.
Experiments on ImageNet-1K demonstrate that TLEG achieves comparable or better performance in contrast to many individual models trained from scratch.
arXiv Detail & Related papers (2023-12-09T17:01:18Z) - Linear attention is (maybe) all you need (to understand transformer
optimization) [55.81555204646486]
We make progress towards understanding the subtleties of training Transformers by studying a simple yet canonicalized shallow Transformer model.
Most importantly, we observe that our proposed linearized models can reproduce several prominent aspects of Transformer training dynamics.
arXiv Detail & Related papers (2023-10-02T10:48:42Z) - Tangent Transformers for Composition, Privacy and Removal [58.280295030852194]
Tangent Attention Fine-Tuning (TAFT) is a method for fine-tuning linearized transformers.
Tangent Attention Fine-Tuning (TAFT) is a method for fine-tuning linearized transformers.
arXiv Detail & Related papers (2023-07-16T18:31:25Z) - Momentum Transformer: Closing the Performance Gap Between Self-attention
and Its Linearization [31.28396970291575]
Leveraging techniques include sparse and linear attention and hashing tricks; efficient transformers have been proposed to reduce the quadratic complexity of transformers but significantly degrade the accuracy.
We first interpret the linear attention and residual connections in computing the attention map as gradient descent steps.
We then introduce momentum into these components and propose the emphmomentum transformer, which utilizes momentum to improve the accuracy of linear transformers while maintaining linear memory and computational complexities.
arXiv Detail & Related papers (2022-08-01T02:37:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.