PiT: Progressive Diffusion Transformer
- URL: http://arxiv.org/abs/2505.13219v3
- Date: Wed, 06 Aug 2025 11:28:19 GMT
- Title: PiT: Progressive Diffusion Transformer
- Authors: Jiafu Wu, Yabiao Wang, Jian Li, Jinlong Peng, Yun Cao, Chengjie Wang, Jiangning Zhang,
- Abstract summary: We propose a series of Pseudo textbfProgressive Dtextbfiffusion textbfTransformer (textbfPiT)<n>Our proposed PiT-L achieves 54%$uparrow$ FID improvement over DiT-XL/2 while using less computation.
- Score: 50.46345527963736
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion Transformers (DiTs) achieve remarkable performance within image generation via the transformer architecture. Conventionally, DiTs are constructed by stacking serial isotropic global modeling transformers, which face significant quadratic computational cost. However, through empirical analysis, we find that DiTs do not rely as heavily on global information as previously believed. In fact, most layers exhibit significant redundancy in global computation. Additionally, conventional attention mechanisms suffer from low-frequency inertia, limiting their efficiency. To address these issues, we propose Pseudo Shifted Window Attention (PSWA), which fundamentally mitigates global attention redundancy. PSWA achieves moderate global-local information through window attention. It further utilizes a high-frequency bridging branch to simulate shifted window operations, which both enrich the high-frequency information and strengthen inter-window connections. Furthermore, we propose the Progressive Coverage Channel Allocation (PCCA) strategy that captures high-order attention without additional computational cost. Based on these innovations, we propose a series of Pseudo \textbf{P}rogressive D\textbf{i}ffusion \textbf{T}ransformer (\textbf{PiT}). Our extensive experiments show their superior performance; for example, our proposed PiT-L achieves 54\%$\uparrow$ FID improvement over DiT-XL/2 while using less computation.
Related papers
- Can We Achieve Efficient Diffusion without Self-Attention? Distilling Self-Attention into Convolutions [94.21989689001848]
We propose (Delta)ConvFusion to replace conventional self-attention modules with Pyramid Convolution Blocks ((Delta)ConvBlocks)<n>By distilling attention patterns into localized convolutional operations while keeping other components frozen, (Delta)ConvFusion achieves performance comparable to transformer-based counterparts while reducing computational cost by 6929$times$ and surpassing LinFusion by 5.42$times$ in efficiency--all without compromising generative fidelity.
arXiv Detail & Related papers (2025-04-30T03:57:28Z) - Contrast: A Hybrid Architecture of Transformers and State Space Models for Low-Level Vision [3.574664325523221]
We propose textbfContrast, a hybrid SR model that combines textbfConvolutional, textbfTransformer, and textbfState Space components.<n>By integrating transformer and state space mechanisms, textbfContrast compensates for the shortcomings of each approach, enhancing both global context modeling and pixel-level accuracy.
arXiv Detail & Related papers (2025-01-23T03:34:14Z) - Qihoo-T2X: An Efficient Proxy-Tokenized Diffusion Transformer for Text-to-Any-Task [42.422925759342874]
We propose the Proxy-Tokenized Diffusion Transformer (PT-DiT) to model global visual information efficiently.
Within each transformer block, we compute an averaging token from each spatial-temporal window to serve as a proxy token for that region.
We also introduce window and shift window attention to address the limitations in detail modeling caused by the sparse attention mechanism.
arXiv Detail & Related papers (2024-09-06T03:13:45Z) - HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution [70.52256118833583]
We present a strategy to convert transformer-based SR networks to hierarchical transformers (HiT-SR)
Specifically, we first replace the commonly used fixed small windows with expanding hierarchical windows to aggregate features at different scales.
Considering the intensive computation required for large windows, we further design a spatial-channel correlation method with linear complexity to window sizes.
arXiv Detail & Related papers (2024-07-08T12:42:10Z) - FORA: Fast-Forward Caching in Diffusion Transformer Acceleration [39.51519525071639]
Diffusion transformers (DiT) have become the de facto choice for generating high-quality images and videos.
Fast-FORward CAching (FORA) is designed to accelerate DiT by exploiting the repetitive nature of the diffusion process.
arXiv Detail & Related papers (2024-07-01T16:14:37Z) - CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer [8.962657021133925]
Cross-scale transformer (CT) processes feature representations at different stages without additional computation.
We introduce an adaptive matching-aware transformer (AMT) that employs different interactive attention combinations at multiple scales.
We also present a dual-feature guided aggregation (DFGA) that embeds the coarse global semantic information into the finer cost volume construction.
arXiv Detail & Related papers (2023-12-14T01:33:18Z) - Exploring Frequency-Inspired Optimization in Transformer for Efficient Single Image Super-Resolution [32.29219284419944]
Cross-refinement adaptive feature modulation transformer (CRAFT)<n>We introduce a frequency-guided post-training quantization (PTQ) method aimed at enhancing CRAFT's efficiency.<n>Our experimental findings showcase CRAFT's superiority over current state-of-the-art methods.
arXiv Detail & Related papers (2023-08-09T15:38:36Z) - SGFormer: Simplifying and Empowering Transformers for Large-Graph Representations [75.71298846760303]
We show that a one-layer attention can bring up surprisingly competitive performance across node property prediction benchmarks.
We frame the proposed scheme as Simplified Graph Transformers (SGFormer), which is empowered by a simple attention model.
We believe the proposed methodology alone enlightens a new technical path of independent interest for building Transformers on large graphs.
arXiv Detail & Related papers (2023-06-19T08:03:25Z) - RFR-WWANet: Weighted Window Attention-Based Recovery Feature Resolution
Network for Unsupervised Image Registration [7.446209993071451]
The Swin transformer has attracted attention in medical image analysis due to its computational efficiency and long-range modeling capability.
The registration models based on transformers combine multiple voxels into a single semantic token.
This merging process limits the transformers to model and generate coarse-grained spatial information.
We propose Recovery Feature Resolution Network (RFRNet), which allows the transformer to contribute fine-grained spatial information.
arXiv Detail & Related papers (2023-05-07T09:57:29Z) - Inception Transformer [151.939077819196]
Inception Transformer, or iFormer, learns comprehensive features with both high- and low-frequency information in visual data.
We benchmark the iFormer on a series of vision tasks, and showcase that it achieves impressive performance on image classification, COCO detection and ADE20K segmentation.
arXiv Detail & Related papers (2022-05-25T17:59:54Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - Rethinking Global Context in Crowd Counting [70.54184500538338]
A pure transformer is used to extract features with global information from overlapping image patches.
Inspired by classification, we add a context token to the input sequence, to facilitate information exchange with tokens corresponding to image patches.
arXiv Detail & Related papers (2021-05-23T12:44:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.