RAPID^3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformer
- URL: http://arxiv.org/abs/2509.22323v1
- Date: Fri, 26 Sep 2025 13:20:52 GMT
- Title: RAPID^3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformer
- Authors: Wangbo Zhao, Yizeng Han, Zhiwei Tang, Jiasheng Tang, Pengfei Zhou, Kai Wang, Bohan Zhuang, Zhangyang Wang, Fan Wang, Yang You,
- Abstract summary: Diffusion Transformers (DiTs) excel at visual generation yet remain hampered by slow sampling.<n>We introduce RAPID3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformers.<n>It delivers image-wise acceleration with zero updates to the base generator.<n>It achieves nearly 3x faster sampling with competitive generation quality.
- Score: 86.57077884971478
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion Transformers (DiTs) excel at visual generation yet remain hampered by slow sampling. Existing training-free accelerators - step reduction, feature caching, and sparse attention - enhance inference speed but typically rely on a uniform heuristic or a manually designed adaptive strategy for all images, leaving quality on the table. Alternatively, dynamic neural networks offer per-image adaptive acceleration, but their high fine-tuning costs limit broader applicability. To address these limitations, we introduce RAPID3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformers, a framework that delivers image-wise acceleration with zero updates to the base generator. Specifically, three lightweight policy heads - Step-Skip, Cache-Reuse, and Sparse-Attention - observe the current denoising state and independently decide their corresponding speed-up at each timestep. All policy parameters are trained online via Group Relative Policy Optimization (GRPO) while the generator remains frozen. Meanwhile, an adversarially learned discriminator augments the reward signal, discouraging reward hacking by boosting returns only when generated samples stay close to the original model's distribution. Across state-of-the-art DiT backbones, including Stable Diffusion 3 and FLUX, RAPID3 achieves nearly 3x faster sampling with competitive generation quality.
Related papers
- Elastic Diffusion Transformer [32.62353162897611]
Diffusion Transformers (DiT) have demonstrated remarkable generative capabilities but remain highly computationally expensive.<n>We propose textbfElastic Diffusion Transformer (E-DiT), an adaptive acceleration framework for DiT.
arXiv Detail & Related papers (2026-02-15T05:19:17Z) - Fast-SAM3D: 3Dfy Anything in Images but Faster [65.17322167628367]
SAM3D enables scalable, open-world 3D reconstruction from complex scenes, yet its deployment is hindered by prohibitive inference latency.<n>We present textbfFast-SAM3D, a training-free framework that aligns computation with instantaneous generation complexity.
arXiv Detail & Related papers (2026-02-05T04:27:59Z) - Adaptive Visual Autoregressive Acceleration via Dual-Linkage Entropy Analysis [50.48301331112126]
We propose NOVA, a training-free token reduction acceleration framework for Visual AutoRegressive modeling.<n>NOVA adaptively determines the acceleration activation scale during inference by online identifying the inflection point of scale entropy growth.<n>Experiments and analyses validate NOVA as a simple yet effective training-free acceleration framework.
arXiv Detail & Related papers (2026-02-01T17:29:42Z) - Fast-ARDiff: An Entropy-informed Acceleration Framework for Continuous Space Autoregressive Generation [12.384836052394272]
Autoregressive(AR)-diffusion hybrid paradigms combine AR's structured modeling with diffusion's synthesis.<n>We propose a unified AR-diffusion framework Fast-ARDiff that jointly optimize both components.<n>Fast-ARDiff achieves state-of-the-art acceleration across diverse models.
arXiv Detail & Related papers (2025-12-09T12:35:18Z) - USV: Unified Sparsification for Accelerating Video Diffusion Models [11.011602744993942]
Unified Sparsification for Video diffusion models is an end-to-end trainable framework.<n>It orchestrates sparsification across both the model's internal computation and its sampling process.<n>It achieves up to 83.3% speedup in the denoising process and 22.7% end-to-end acceleration, while maintaining high visual fidelity.
arXiv Detail & Related papers (2025-12-05T14:40:06Z) - MeanFlow Transformers with Representation Autoencoders [71.45823902973349]
MeanFlow (MF) is a diffusion-motivated generative model that enables efficient few-step generation by learning long jumps directly from noise to data.<n>We develop an efficient training and sampling scheme for MF in the latent space of a Representation Autoencoder (RAE)<n>We achieve a 1-step FID of 2.03, outperforming vanilla MF's 3.43, while reducing sampling GFLOPS by 38% and total training cost by 83% on ImageNet 256.
arXiv Detail & Related papers (2025-11-17T06:17:08Z) - TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs [67.55973229034319]
This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (MLLMs) to video temporal grounding tasks.<n>We show that TempSamp-R1 outperforms GRPO-based baselines, establishing new state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2025-09-22T17:30:15Z) - SpeCa: Accelerating Diffusion Transformers with Speculative Feature Caching [17.724549528455317]
Diffusion models have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications.<n>We present SpeCa, a novel 'Forecast-then-verify' acceleration framework that effectively addresses both limitations.<n>Our approach implements a parameter-free verification mechanism that efficiently evaluates prediction reliability, enabling real-time decisions to accept or reject each prediction.
arXiv Detail & Related papers (2025-09-15T06:46:22Z) - Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers [9.875073051988057]
Region-Adaptive Latent Upsampling (RALU) is a training-free framework that accelerates inference along spatial dimension.<n>RALU performs mixed-resolution sampling across three stages: 1) low-resolution denoising latent diffusion to efficiently capture global semantic structure, 2) region-adaptive upsampling on specific regions prone to artifacts at full-resolution, and 3) all latent upsampling at full-resolution for detail refinement.<n>Our method significantly reduces computation while preserving image quality by achieving up to 7.0$times$ speed-up on FLUX and 3.0$times$ on Stable Diffusion 3 with minimal degradation.
arXiv Detail & Related papers (2025-07-11T09:07:43Z) - SkipVAR: Accelerating Visual Autoregressive Modeling via Adaptive Frequency-Aware Skipping [30.85025293160079]
High-frequency components, or later steps, in the generation process contribute disproportionately to inference latency.<n>We identify two primary sources of inefficiency: step redundancy and unconditional branch redundancy.<n>We propose an automatic step-skipping strategy that selectively omits unnecessary generation steps to improve efficiency.
arXiv Detail & Related papers (2025-06-10T15:35:29Z) - ALTER: All-in-One Layer Pruning and Temporal Expert Routing for Efficient Diffusion Generation [40.68265817413368]
We introduce ALTER: All-in-One Layer Pruning and Temporal Expert Routing.<n>A unified framework that transforms diffusion models into a mixture of efficient temporal experts.<n>A single-stage optimization that unifies layer pruning, expert routing, and model fine-tuning by employing a trainable hypernetwork.
arXiv Detail & Related papers (2025-05-27T22:59:44Z) - Rethinking Video Tokenization: A Conditioned Diffusion-based Approach [58.164354605550194]
New tokenizer, Diffusion Conditioned-based Gene Tokenizer, replaces GAN-based decoder with conditional diffusion model.<n>We trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch.<n>Even a scaled-down version of CDT (3$times inference speedup) still performs comparably with top baselines.
arXiv Detail & Related papers (2025-03-05T17:59:19Z) - F3-Pruning: A Training-Free and Generalized Pruning Strategy towards
Faster and Finer Text-to-Video Synthesis [94.10861578387443]
We explore the inference process of two mainstream T2V models using transformers and diffusion models.
We propose a training-free and generalized pruning strategy called F3-Pruning to prune redundant temporal attention weights.
Extensive experiments on three datasets using a classic transformer-based model CogVideo and a typical diffusion-based model Tune-A-Video verify the effectiveness of F3-Pruning.
arXiv Detail & Related papers (2023-12-06T12:34:47Z) - Prompt2NeRF-PIL: Fast NeRF Generation via Pretrained Implicit Latent [61.56387277538849]
This paper explores promptable NeRF generation for direct conditioning and fast generation of NeRF parameters for the underlying 3D scenes.
Prompt2NeRF-PIL is capable of generating a variety of 3D objects with a single forward pass.
We will show that our approach speeds up the text-to-NeRF model DreamFusion and the 3D reconstruction speed of the image-to-NeRF method Zero-1-to-3 by 3 to 5 times.
arXiv Detail & Related papers (2023-12-05T08:32:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.