Related papers: Flash-VAED: Plug-and-Play VAE Decoders for Efficient Video Generation

Flash-VAED: Plug-and-Play VAE Decoders for Efficient Video Generation

URL: http://arxiv.org/abs/2602.19161v1
Date: Sun, 22 Feb 2026 12:43:50 GMT
Title: Flash-VAED: Plug-and-Play VAE Decoders for Efficient Video Generation
Authors: Lunjie Zhu, Yushi Huang, Xingtong Ge, Yufei Xue, Zhening Liu, Yumeng Zhang, Zehong Lin, Jun Zhang,
Abstract summary: Latent diffusion models have enabled high-quality video synthesis, yet their inference remains costly and time-consuming.<n>We propose a universal acceleration framework for VAE decoders that preserves full alignment with the original latent distribution.<n>We show that Flash-VAED accelerates the end-to-end generation pipeline by up to 36% with negligible quality drops on VBench-2.0.
Score: 16.210613736589597
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Latent diffusion models have enabled high-quality video synthesis, yet their inference remains costly and time-consuming. As diffusion transformers become increasingly efficient, the latency bottleneck inevitably shifts to VAE decoders. To reduce their latency while maintaining quality, we propose a universal acceleration framework for VAE decoders that preserves full alignment with the original latent distribution. Specifically, we propose (1) an independence-aware channel pruning method to effectively mitigate severe channel redundancy, and (2) a stage-wise dominant operator optimization strategy to address the high inference cost of the widely used causal 3D convolutions in VAE decoders. Based on these innovations, we construct a Flash-VAED family. Moreover, we design a three-phase dynamic distillation framework that efficiently transfers the capabilities of the original VAE decoder to Flash-VAED. Extensive experiments on Wan and LTX-Video VAE decoders demonstrate that our method outperforms baselines in both quality and speed, achieving approximately a 6$\times$ speedup while maintaining the reconstruction performance up to 96.9%. Notably, Flash-VAED accelerates the end-to-end generation pipeline by up to 36% with negligible quality drops on VBench-2.0.

Related papers

FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution [61.284842030283464]
FlashVSR is the first diffusion-based one-step streaming framework towards real-time VSR.<n>It runs at approximately 17 FPS for 768x1408 videos on a single A100 GPU.<n>It scales reliably to ultra-high resolutions and achieves state-of-the-art performance with up to 12x speedup over prior one-step diffusion VSR models.
arXiv Detail & Related papers (2025-10-14T17:25:54Z)
SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization [56.12853087022071]
We introduce a new pixel diffusion decoder architecture for improved scaling and training stability.<n>We use distillation to replicate the performance of the diffusion decoder in an efficient single-step decoder.<n>This makes SSDD the first diffusion decoder optimized for single-step reconstruction trained without adversarial losses.
arXiv Detail & Related papers (2025-10-06T15:57:31Z)
A Lightweight Dual-Mode Optimization for Generative Face Video Coding [26.308480665852052]
Generative Face Video Coding (GFVC) achieves superior rate-distortion performance by leveraging the strong inference capabilities of deep generative models.<n>We propose a lightweight GFVC framework that introduces dual-mode optimization to reduce complexity whilst preserving reconstruction quality.<n> Experimental results demonstrate that the proposed lightweight dual-mode optimization for GFVC can achieve 90.4% parameter reduction and 88.9% saving compared to the baseline.
arXiv Detail & Related papers (2025-08-19T06:09:28Z)
FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion [44.206702976963676]
We introduce FPSAttention, a novel training-aware co-design of FP8 quantization and sparsity for video generation.<n>Our approach features three key innovations: 1) A unified 3D tile-wise granularity that simultaneously supports both quantization and sparsity; 2) A denoising step-aware strategy that adapts to the noise schedule, addressing the strong correlation between quantization/sparsity errors and denoising steps; 3) A native, hardware-friendly kernel that leverages FlashAttention and is implemented with Hopper optimized architecture features for highly efficient execution.
arXiv Detail & Related papers (2025-06-05T05:30:30Z)
VORTA: Efficient Video Diffusion via Routing Sparse Attention [54.84294780326206]
VORTA is an acceleration framework with two novel components.<n>It achieves an end-to-end speedup $1.76times$ without loss of quality on VBench.<n>It can seamlessly integrate with various other acceleration methods, such as model caching and step distillation, reaching up to speedup $14.41times$ with negligible performance degradation.
arXiv Detail & Related papers (2025-05-24T17:46:47Z)
H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models [97.45170082949552]
Autoencoder (AE) is the key to the success of latent diffusion models for image and video generation.<n>H3AE achieves ultra-high compression ratios and real-time decoding speed on GPU and mobile.
arXiv Detail & Related papers (2025-04-14T17:59:06Z)
Unleashing Vecset Diffusion Model for Fast Shape Generation [21.757511934035758]
FlashVDM is a framework for accelerating both VAE and DiT in Vecset Diffusion Model (VDM)<n>For DiT, FlashVDM enables flexible diffusion sampling with as few as 5 inference steps and comparable quality.<n>For VAE, we introduce a lightning vecset decoder equipped with Adaptive KV Selection, Hierarchical Volume Decoding, and Efficient Network Design.
arXiv Detail & Related papers (2025-03-20T16:23:44Z)
Rethinking Video Tokenization: A Conditioned Diffusion-based Approach [58.164354605550194]
New tokenizer, Diffusion Conditioned-based Gene Tokenizer, replaces GAN-based decoder with conditional diffusion model.<n>We trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch.<n>Even a scaled-down version of CDT (3$times inference speedup) still performs comparably with top baselines.
arXiv Detail & Related papers (2025-03-05T17:59:19Z)
Towards Stabilized and Efficient Diffusion Transformers through Long-Skip-Connections with Spectral Constraints [51.83081671798784]
Diffusion Transformers (DiT) have emerged as a powerful architecture for image and video generation, offering superior quality and scalability.<n>DiT's practical application suffers from inherent dynamic feature instability, leading to error amplification during cached inference.<n>We propose Skip-DiT, an image and video generative DiT variant enhanced with Long-Skip-Connections (LSCs) - the key efficiency component in U-Nets.
arXiv Detail & Related papers (2024-11-26T17:28:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.