Flash-VAED: Plug-and-Play VAE Decoders for Efficient Video Generation
- URL: http://arxiv.org/abs/2602.19161v1
- Date: Sun, 22 Feb 2026 12:43:50 GMT
- Title: Flash-VAED: Plug-and-Play VAE Decoders for Efficient Video Generation
- Authors: Lunjie Zhu, Yushi Huang, Xingtong Ge, Yufei Xue, Zhening Liu, Yumeng Zhang, Zehong Lin, Jun Zhang,
- Abstract summary: Latent diffusion models have enabled high-quality video synthesis, yet their inference remains costly and time-consuming.<n>We propose a universal acceleration framework for VAE decoders that preserves full alignment with the original latent distribution.<n>We show that Flash-VAED accelerates the end-to-end generation pipeline by up to 36% with negligible quality drops on VBench-2.0.
- Score: 16.210613736589597
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Latent diffusion models have enabled high-quality video synthesis, yet their inference remains costly and time-consuming. As diffusion transformers become increasingly efficient, the latency bottleneck inevitably shifts to VAE decoders. To reduce their latency while maintaining quality, we propose a universal acceleration framework for VAE decoders that preserves full alignment with the original latent distribution. Specifically, we propose (1) an independence-aware channel pruning method to effectively mitigate severe channel redundancy, and (2) a stage-wise dominant operator optimization strategy to address the high inference cost of the widely used causal 3D convolutions in VAE decoders. Based on these innovations, we construct a Flash-VAED family. Moreover, we design a three-phase dynamic distillation framework that efficiently transfers the capabilities of the original VAE decoder to Flash-VAED. Extensive experiments on Wan and LTX-Video VAE decoders demonstrate that our method outperforms baselines in both quality and speed, achieving approximately a 6$\times$ speedup while maintaining the reconstruction performance up to 96.9%. Notably, Flash-VAED accelerates the end-to-end generation pipeline by up to 36% with negligible quality drops on VBench-2.0.
Related papers
- FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution [61.284842030283464]
FlashVSR is the first diffusion-based one-step streaming framework towards real-time VSR.<n>It runs at approximately 17 FPS for 768x1408 videos on a single A100 GPU.<n>It scales reliably to ultra-high resolutions and achieves state-of-the-art performance with up to 12x speedup over prior one-step diffusion VSR models.
arXiv Detail & Related papers (2025-10-14T17:25:54Z) - SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization [56.12853087022071]
We introduce a new pixel diffusion decoder architecture for improved scaling and training stability.<n>We use distillation to replicate the performance of the diffusion decoder in an efficient single-step decoder.<n>This makes SSDD the first diffusion decoder optimized for single-step reconstruction trained without adversarial losses.
arXiv Detail & Related papers (2025-10-06T15:57:31Z) - A Lightweight Dual-Mode Optimization for Generative Face Video Coding [26.308480665852052]
Generative Face Video Coding (GFVC) achieves superior rate-distortion performance by leveraging the strong inference capabilities of deep generative models.<n>We propose a lightweight GFVC framework that introduces dual-mode optimization to reduce complexity whilst preserving reconstruction quality.<n> Experimental results demonstrate that the proposed lightweight dual-mode optimization for GFVC can achieve 90.4% parameter reduction and 88.9% saving compared to the baseline.
arXiv Detail & Related papers (2025-08-19T06:09:28Z) - FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion [44.206702976963676]
We introduce FPSAttention, a novel training-aware co-design of FP8 quantization and sparsity for video generation.<n>Our approach features three key innovations: 1) A unified 3D tile-wise granularity that simultaneously supports both quantization and sparsity; 2) A denoising step-aware strategy that adapts to the noise schedule, addressing the strong correlation between quantization/sparsity errors and denoising steps; 3) A native, hardware-friendly kernel that leverages FlashAttention and is implemented with Hopper optimized architecture features for highly efficient execution.
arXiv Detail & Related papers (2025-06-05T05:30:30Z) - VORTA: Efficient Video Diffusion via Routing Sparse Attention [54.84294780326206]
VORTA is an acceleration framework with two novel components.<n>It achieves an end-to-end speedup $1.76times$ without loss of quality on VBench.<n>It can seamlessly integrate with various other acceleration methods, such as model caching and step distillation, reaching up to speedup $14.41times$ with negligible performance degradation.
arXiv Detail & Related papers (2025-05-24T17:46:47Z) - H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models [97.45170082949552]
Autoencoder (AE) is the key to the success of latent diffusion models for image and video generation.<n>H3AE achieves ultra-high compression ratios and real-time decoding speed on GPU and mobile.
arXiv Detail & Related papers (2025-04-14T17:59:06Z) - Unleashing Vecset Diffusion Model for Fast Shape Generation [21.757511934035758]
FlashVDM is a framework for accelerating both VAE and DiT in Vecset Diffusion Model (VDM)<n>For DiT, FlashVDM enables flexible diffusion sampling with as few as 5 inference steps and comparable quality.<n>For VAE, we introduce a lightning vecset decoder equipped with Adaptive KV Selection, Hierarchical Volume Decoding, and Efficient Network Design.
arXiv Detail & Related papers (2025-03-20T16:23:44Z) - Rethinking Video Tokenization: A Conditioned Diffusion-based Approach [58.164354605550194]
New tokenizer, Diffusion Conditioned-based Gene Tokenizer, replaces GAN-based decoder with conditional diffusion model.<n>We trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch.<n>Even a scaled-down version of CDT (3$times inference speedup) still performs comparably with top baselines.
arXiv Detail & Related papers (2025-03-05T17:59:19Z) - Towards Stabilized and Efficient Diffusion Transformers through Long-Skip-Connections with Spectral Constraints [51.83081671798784]
Diffusion Transformers (DiT) have emerged as a powerful architecture for image and video generation, offering superior quality and scalability.<n>DiT's practical application suffers from inherent dynamic feature instability, leading to error amplification during cached inference.<n>We propose Skip-DiT, an image and video generative DiT variant enhanced with Long-Skip-Connections (LSCs) - the key efficiency component in U-Nets.
arXiv Detail & Related papers (2024-11-26T17:28:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.