DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers
- URL: http://arxiv.org/abs/2602.16968v1
- Date: Thu, 19 Feb 2026 00:15:20 GMT
- Title: DDiT: Dynamic Patch Scheduling for Efficient Diffusion Transformers
- Authors: Dahye Kim, Deepti Ghadiyaram, Raghudeep Gadde,
- Abstract summary: Diffusion Transformers (DiTs) have achieved state-of-the-art performance in image and video generation, but their success comes at the cost of heavy computation.<n>We propose dynamic tokenization, an efficient test-time strategy that varies patch sizes based on content complexity and the denoising timestep.<n>During inference, our method dynamically reallocates patch sizes across denoising steps for image and video generation and substantially reduces cost while preserving perceptual generation quality.
- Score: 6.406853903837331
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion Transformers (DiTs) have achieved state-of-the-art performance in image and video generation, but their success comes at the cost of heavy computation. This inefficiency is largely due to the fixed tokenization process, which uses constant-sized patches throughout the entire denoising phase, regardless of the content's complexity. We propose dynamic tokenization, an efficient test-time strategy that varies patch sizes based on content complexity and the denoising timestep. Our key insight is that early timesteps only require coarser patches to model global structure, while later iterations demand finer (smaller-sized) patches to refine local details. During inference, our method dynamically reallocates patch sizes across denoising steps for image and video generation and substantially reduces cost while preserving perceptual generation quality. Extensive experiments demonstrate the effectiveness of our approach: it achieves up to $3.52\times$ and $3.2\times$ speedup on FLUX-1.Dev and Wan $2.1$, respectively, without compromising the generation quality and prompt adherence.
Related papers
- Frequency-Aware Error-Bounded Caching for Accelerating Diffusion Transformers [11.772150619675527]
Diffusion Transformers (DiTs) have emerged as the dominant architecture for high-quality image and video generation.<n>Existing caching methods accelerate DiTs by reusing intermediate computations across timesteps, but they share a common limitation: treating the denoising process as uniform across time,depth, and feature dimensions.<n>We propose SpectralCache, a unified caching framework comprising Timestep-Aware Dynamic Scheduling (TADS), Cumulative Error Budgets (CEB), and Frequency-Decomposed Caching (FDC)
arXiv Detail & Related papers (2026-03-05T15:58:06Z) - Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache [8.614492355393578]
We propose DPCache, a training-free acceleration framework that formulates diffusion acceleration as a global path planning problem.<n> DPCache employs dynamic programming to select an optimal sequence of key timesteps that minimizes the total path cost while preserving trajectory fidelity.<n>Experiments on DiT, FLUX, and HunyuanVideo demonstrate that DPCache achieves strong acceleration with minimal quality loss.
arXiv Detail & Related papers (2026-02-26T06:13:33Z) - LiteAttention: A Temporal Sparse Attention for Diffusion Transformers [1.3471268811218626]
LiteAttention exploits temporal coherence to enable evolutionary computation skips across the denoising sequence.<n>We implement a highly optimized LiteAttention kernel on top of FlashAttention and demonstrate substantial speedups on production video diffusion models.
arXiv Detail & Related papers (2025-11-14T08:26:55Z) - PPMStereo: Pick-and-Play Memory Construction for Consistent Dynamic Stereo Matching [51.98089287914147]
textbfPick-and-textbflay textbfMemory (PM) construction module for dynamic bfStereo matching, dubbed as bftextPPMStereo.<n>Inspired by the two-stage decision-making process in humans, we propose a textbfPick-and-textbflay textbfMemory (PM) construction module for dynamic bfStereo matching, dubbed as bftextPPMStereo.
arXiv Detail & Related papers (2025-10-23T03:52:39Z) - Sliding Window Attention for Learned Video Compression [67.57073402826292]
This work introduces 3D Sliding Window Attention (SWA), a patchless form of local attention.<n>Our method significantly improves rate-distortion performance, achieving Bjorntegaard Delta-rate savings of up to 18.6 %.
arXiv Detail & Related papers (2025-10-04T20:11:43Z) - H$_{2}$OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers [124.11648300910444]
We present a hierarchical plug-and-play pruning-and-$-recovering framework, called Hierarchical Hourglass Tokenizer (H$_2$OT)<n>Our method is general-purpose: it can be easily incorporated into common VPT models on both seq2seq and seq2frame pipelines.
arXiv Detail & Related papers (2025-09-08T17:59:59Z) - Foresight: Adaptive Layer Reuse for Accelerated and High-Quality Text-to-Video Generation [3.321460333625124]
Diffusion Transformers (DiTs) achieve state-of-the-art results in text-to-image, text-to-video generation, and editing.<n> Static caching mitigates this by reusing features across fixed steps but fails to adapt to generation dynamics.<n>We propose Foresight, an adaptive layer-reuse technique that reduces computational redundancy across denoising steps while preserving baseline performance.
arXiv Detail & Related papers (2025-05-31T00:52:17Z) - One-Step Diffusion Model for Image Motion-Deblurring [85.76149042561507]
We propose a one-step diffusion model for deblurring (OSDD), a novel framework that reduces the denoising process to a single step.<n>To tackle fidelity loss in diffusion models, we introduce an enhanced variational autoencoder (eVAE), which improves structural restoration.<n>Our method achieves strong performance on both full and no-reference metrics.
arXiv Detail & Related papers (2025-03-09T09:39:57Z) - Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers [55.87192133758051]
Diffusion Transformers (DiTs) have achieved state-of-the-art (SOTA) image generation quality but suffer from high latency and memory inefficiency.<n>We propose DiffCR, a dynamic DiT inference framework with differentiable compression ratios.
arXiv Detail & Related papers (2024-12-22T02:04:17Z) - HarmoniCa: Harmonizing Training and Inference for Better Feature Caching in Diffusion Transformer Acceleration [31.982294870690925]
We develop a novel learning-based caching framework dubbed HarmoniCa.<n>It incorporates Step-Wise Denoising Training (SDT) to ensure the continuity of the denoising process.<n>Our framework achieves over $40%$ latency reduction (i.e., $2.07times$ theoretical speedup) and improved performance on PixArt-$alpha$.
arXiv Detail & Related papers (2024-10-02T16:34:29Z) - Investigating Tradeoffs in Real-World Video Super-Resolution [90.81396836308085]
Real-world video super-resolution (VSR) models are often trained with diverse degradations to improve generalizability.
To alleviate the first tradeoff, we propose a degradation scheme that reduces up to 40% of training time without sacrificing performance.
To facilitate fair comparisons, we propose the new VideoLQ dataset, which contains a large variety of real-world low-quality video sequences.
arXiv Detail & Related papers (2021-11-24T18:58:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.