U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers
- URL: http://arxiv.org/abs/2405.02730v3
- Date: Wed, 30 Oct 2024 16:13:42 GMT
- Title: U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers
- Authors: Yuchuan Tian, Zhijun Tu, Hanting Chen, Jie Hu, Chao Xu, Yunhe Wang,
- Abstract summary: We propose a series of U-shaped DiTs (U-DiTs) in the paper and conduct extensive experiments to demonstrate the extraordinary performance of U-DiT models.
The proposed U-DiT could outperform DiT-XL/2 with only 1/6 of its cost computation.
- Score: 28.936553798624136
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion Transformers (DiTs) introduce the transformer architecture to diffusion tasks for latent-space image generation. With an isotropic architecture that chains a series of transformer blocks, DiTs demonstrate competitive performance and good scalability; but meanwhile, the abandonment of U-Net by DiTs and their following improvements is worth rethinking. To this end, we conduct a simple toy experiment by comparing a U-Net architectured DiT with an isotropic one. It turns out that the U-Net architecture only gain a slight advantage amid the U-Net inductive bias, indicating potential redundancies within the U-Net-style DiT. Inspired by the discovery that U-Net backbone features are low-frequency-dominated, we perform token downsampling on the query-key-value tuple for self-attention that bring further improvements despite a considerable amount of reduction in computation. Based on self-attention with downsampled tokens, we propose a series of U-shaped DiTs (U-DiTs) in the paper and conduct extensive experiments to demonstrate the extraordinary performance of U-DiT models. The proposed U-DiT could outperform DiT-XL/2 with only 1/6 of its computation cost. Codes are available at https://github.com/YuchuanTian/U-DiT.
Related papers
- PiT: Progressive Diffusion Transformer [50.46345527963736]
We propose a series of Pseudo textbfProgressive Dtextbfiffusion textbfTransformer (textbfPiT)<n>Our proposed PiT-L achieves 54%$uparrow$ FID improvement over DiT-XL/2 while using less computation.
arXiv Detail & Related papers (2025-05-19T15:02:33Z) - U-REPA: Aligning Diffusion U-Nets to ViTs [29.987838381433445]
We propose U-REPA, a representation alignment paradigm that bridges U-Net hidden states and ViT features.
Experiments indicate that the resulting U-REPA could achieve excellent generation quality and greatly accelerates the convergence speed.
arXiv Detail & Related papers (2025-03-24T07:46:00Z) - Deep Tree Tensor Networks for Image Recognition [1.8434042562191815]
This paper introduces a novel architecture named ittextbfDeep textbfTree textbfTensor textbfNetwork (DTTN)
DTTN captures $2L$-order multiplicative interactions across features through multilinear operations.
We theoretically reveal the equivalency among quantum-inspired TN models and interacting networks under certain conditions.
arXiv Detail & Related papers (2025-02-14T05:41:33Z) - Accelerating Vision Diffusion Transformers with Skip Branches [46.19946204953147]
Diffusion Transformers (DiT) are an emerging image and video generation model architecture.
DiT's practical deployment is constrained by computational complexity and redundancy in the sequential denoising process.
We introduce Skip-DiT, which converts standard DiT into Skip-DiT with skip branches to enhance feature smoothness.
We also introduce Skip-Cache which utilizes the skip branches to cache DiT features across timesteps at the inference time.
arXiv Detail & Related papers (2024-11-26T17:28:10Z) - Shrinking the Giant : Quasi-Weightless Transformers for Low Energy Inference [0.30104001512119216]
Building models with fast and energy-efficient inference is imperative to enable a variety of transformer-based applications.
We build on an approach for learning LUT networks directly via an Extended Finite Difference method.
This allows for a computational and energy-efficient inference solution for transformer-based models.
arXiv Detail & Related papers (2024-11-04T05:38:56Z) - Dynamic Diffusion Transformer [67.13876021157887]
Diffusion Transformer (DiT) has demonstrated superior performance but suffers from substantial computational costs.
We propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation.
With 3% additional fine-tuning, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73, and achieves a competitive FID score of 2.07 on ImageNet.
arXiv Detail & Related papers (2024-10-04T14:14:28Z) - Shallow Cross-Encoders for Low-Latency Retrieval [69.06104373460597]
Cross-Encoders based on large transformer models (such as BERT or T5) are computationally expensive and allow for scoring only a small number of documents within a reasonably small latency window.
We show that weaker shallow transformer models (i.e., transformers with a limited number of layers) actually perform better than full-scale models when constrained to these practical low-latency settings.
arXiv Detail & Related papers (2024-03-29T15:07:21Z) - VST++: Efficient and Stronger Visual Saliency Transformer [74.26078624363274]
We develop an efficient and stronger VST++ model to explore global long-range dependencies.
We evaluate our model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD benchmark datasets.
arXiv Detail & Related papers (2023-10-18T05:44:49Z) - EIT: Efficiently Lead Inductive Biases to ViT [17.66805405320505]
Vision Transformer (ViT) depends on properties similar to the inductive bias inherent in Convolutional Neural Networks.
We propose an architecture called Efficiently lead Inductive biases to ViT (EIT), which can effectively lead the inductive biases to both phases of ViT.
In four popular small-scale datasets, compared with ViT, EIT has an accuracy improvement of 12.6% on average with fewer parameters and FLOPs.
arXiv Detail & Related papers (2022-03-14T14:01:17Z) - FQ-ViT: Fully Quantized Vision Transformer without Retraining [13.82845665713633]
We present a systematic method to reduce the performance degradation and inference complexity of Quantized Transformers.
We are the first to achieve comparable accuracy degradation (1%) on fully quantized Vision Transformers.
arXiv Detail & Related papers (2021-11-27T06:20:53Z) - Global Vision Transformer Pruning with Hessian-Aware Saliency [93.33895899995224]
This work challenges the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage.
We derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction.
Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter that utilizes parameters more efficiently.
arXiv Detail & Related papers (2021-10-10T18:04:59Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.