Sprint: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers
- URL: http://arxiv.org/abs/2510.21986v1
- Date: Fri, 24 Oct 2025 19:29:55 GMT
- Title: Sprint: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers
- Authors: Dogyun Park, Moayed Haji-Ali, Yanyu Li, Willi Menapace, Sergey Tulyakov, Hyunwoo J. Kim, Aliaksandr Siarohin, Anil Kag,
- Abstract summary: Diffusion Transformers (DiTs) deliver state-of-the-art generative performance but their quadratic training cost with sequence length makes large-scale pretraining prohibitively expensive.<n>We present Sparse--Dense Residual Fusion for Efficient Diffusion Transformers, a simple method that enables aggressive token dropping (up to 75%) while preserving quality.
- Score: 91.02299679350834
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion Transformers (DiTs) deliver state-of-the-art generative performance but their quadratic training cost with sequence length makes large-scale pretraining prohibitively expensive. Token dropping can reduce training cost, yet na\"ive strategies degrade representations, and existing methods are either parameter-heavy or fail at high drop ratios. We present SPRINT, Sparse--Dense Residual Fusion for Efficient Diffusion Transformers, a simple method that enables aggressive token dropping (up to 75%) while preserving quality. SPRINT leverages the complementary roles of shallow and deep layers: early layers process all tokens to capture local detail, deeper layers operate on a sparse subset to cut computation, and their outputs are fused through residual connections. Training follows a two-stage schedule: long masked pre-training for efficiency followed by short full-token fine-tuning to close the train--inference gap. On ImageNet-1K 256x256, SPRINT achieves 9.8x training savings with comparable FID/FDD, and at inference, its Path-Drop Guidance (PDG) nearly halves FLOPs while improving quality. These results establish SPRINT as a simple, effective, and general solution for efficient DiT training.
Related papers
- Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning [71.30276778807068]
We propose a unified framework that strategically coordinates sample pruning and token pruning.<n>Q-Tuning achieves a +38% average improvement over the full-data SFT baseline using only 12.5% of the original training data.
arXiv Detail & Related papers (2025-09-28T13:27:38Z) - DTRNet: Dynamic Token Routing Network to Reduce Quadratic Costs in Transformers [28.595962720945348]
Transformers achieve state-of-the-art results across many tasks, but their uniform application of quadratic self-attention makes them computationally expensive.<n>We introduce Dynamic Token Routing Network, an improved Transformer architecture that allows tokens to dynamically skip the quadratic cost of cross-token mixing.
arXiv Detail & Related papers (2025-08-31T16:21:21Z) - REPA Works Until It Doesn't: Early-Stopped, Holistic Alignment Supercharges Diffusion Training [58.33728862521732]
Diffusion Transformers (DiTs) deliver state-of-the-art image quality, yet their training remains notoriously slow.<n>A recent remedy -- representation alignment (REPA) that matches DiT hidden features to those of a non-generative teacher (e.g. DINO) -- dramatically accelerates the early epochs but plateaus or even degrades performance later.<n>We trace this failure to a capacity mismatch: once the generative student begins modelling the joint data distribution, the teacher's lower-dimensional embeddings and attention patterns become a straitjacket rather than a guide.<n>We introduce HASTE
arXiv Detail & Related papers (2025-05-22T15:34:33Z) - BEExformer: A Fast Inferencing Binarized Transformer with Early Exits [2.7651063843287718]
We introduce Binarized Early Exit Transformer (BEExformer), the first-ever selective learning-based transformer integrating Binarization-Aware Training (BAT) with Early Exit (EE)<n>BAT employs a differentiable second-order approximation to the sign function, enabling gradient that captures both the sign and magnitude of the weights.<n>EE mechanism hinges on fractional reduction in entropy among intermediate transformer blocks with soft-routing loss estimation.<n>This accelerates inference by reducing FLOPs by 52.08% and even improves accuracy by 2.89% by resolving the "overthinking" problem inherent in deep networks.
arXiv Detail & Related papers (2024-12-06T17:58:14Z) - Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference [14.030836300221756]
textbfSparse-Tuning is a novel PEFT method that accounts for the information redundancy in images and videos.
Sparse-Tuning minimizes the quantity of tokens processed at each layer, leading to a quadratic reduction in computational and memory overhead.
Our results show that our Sparse-Tuning reduces GFLOPs to textbf62%-70% of the original ViT-B while achieving state-of-the-art performance.
arXiv Detail & Related papers (2024-05-23T15:34:53Z) - PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation [61.57833648734164]
We propose a novel Parallel Yielding Re-Activation (PYRA) method for training-inference efficient task adaptation.
PYRA outperforms all competing methods under both low compression rate and high compression rate.
arXiv Detail & Related papers (2024-03-14T09:06:49Z) - Dynamic Layer Tying for Parameter-Efficient Transformers [65.268245109828]
We employ Reinforcement Learning to select layers during training and tie them together.
This facilitates weight sharing, reduces the number of trainable parameters, and also serves as an effective regularization technique.
In particular, the memory consumption during training is up to one order of magnitude less than the conventional training method.
arXiv Detail & Related papers (2024-01-23T14:53:20Z) - Layer Freezing & Data Sieving: Missing Pieces of a Generic Framework for
Sparse Training [48.152207339344564]
We show that layer freezing and data sieving can be incorporated into the sparse training algorithm to form a generic framework, which we dub SpFDE.
Our experiments demonstrate that SpFDE can significantly reduce training costs while preserving accuracy from three dimensions: weight sparsity, layer freezing, and dataset sieving.
arXiv Detail & Related papers (2022-09-22T17:45:23Z) - Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution.
Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x.
We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z) - Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for
BERT Training Speedup [13.50984315473865]
We propose an efficient multi-stage layerwise training (MSLT) approach to reduce the training time of BERT.
In the proposed training strategy, only top few layers participate in backward computation, while most layers only participate in forward computation.
Experimental results show that the proposed method can achieve more than 110% training speedup without significant performance degradation.
arXiv Detail & Related papers (2020-11-27T10:00:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.