FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion
- URL: http://arxiv.org/abs/2506.04648v2
- Date: Fri, 06 Jun 2025 03:12:20 GMT
- Title: FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion
- Authors: Akide Liu, Zeyu Zhang, Zhexin Li, Xuehai Bai, Yizeng Han, Jiasheng Tang, Yuanjie Xing, Jichao Wu, Mingyang Yang, Weihua Chen, Jiahao He, Yuanyu He, Fan Wang, Gholamreza Haffari, Bohan Zhuang,
- Abstract summary: We introduce FPSAttention, a novel training-aware co-design of FP8 quantization and sparsity for video generation.<n>Our approach features three key innovations: 1) A unified 3D tile-wise granularity that simultaneously supports both quantization and sparsity; 2) A denoising step-aware strategy that adapts to the noise schedule, addressing the strong correlation between quantization/sparsity errors and denoising steps; 3) A native, hardware-friendly kernel that leverages FlashAttention and is implemented with Hopper optimized architecture features for highly efficient execution.
- Score: 44.206702976963676
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Diffusion generative models have become the standard for producing high-quality, coherent video content, yet their slow inference speeds and high computational demands hinder practical deployment. Although both quantization and sparsity can independently accelerate inference while maintaining generation quality, naively combining these techniques in existing training-free approaches leads to significant performance degradation due to the lack of joint optimization. We introduce FPSAttention, a novel training-aware co-design of FP8 quantization and sparsity for video generation, with a focus on the 3D bi-directional attention mechanism. Our approach features three key innovations: 1) A unified 3D tile-wise granularity that simultaneously supports both quantization and sparsity; 2) A denoising step-aware strategy that adapts to the noise schedule, addressing the strong correlation between quantization/sparsity errors and denoising steps; 3) A native, hardware-friendly kernel that leverages FlashAttention and is implemented with optimized Hopper architecture features for highly efficient execution. Trained on Wan2.1's 1.3B and 14B models and evaluated on the VBench benchmark, FPSAttention achieves a 7.09x kernel speedup for attention operations and a 4.96x end-to-end speedup for video generation compared to the BF16 baseline at 720p resolution-without sacrificing generation quality.
Related papers
- LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration [12.183601881545039]
Diffusion models have achieved remarkable success in image and video generation tasks.<n>However, the high computational demands of Diffusion Transformers pose a significant challenge to their practical deployment.<n>We propose a LEarnable Stage-Aware (LESA) predictor framework based on two-stage training.
arXiv Detail & Related papers (2026-02-24T02:53:28Z) - Flash-VAED: Plug-and-Play VAE Decoders for Efficient Video Generation [16.210613736589597]
Latent diffusion models have enabled high-quality video synthesis, yet their inference remains costly and time-consuming.<n>We propose a universal acceleration framework for VAE decoders that preserves full alignment with the original latent distribution.<n>We show that Flash-VAED accelerates the end-to-end generation pipeline by up to 36% with negligible quality drops on VBench-2.0.
arXiv Detail & Related papers (2026-02-22T12:43:50Z) - GPD: Guided Progressive Distillation for Fast and High-Quality Video Generation [48.965157828225074]
We propose Guided Progressive Distillation (GPD), a framework that accelerates the diffusion process for fast and high-quality video generation.<n>GPD reduces the number of sampling steps from 48 to 6 while maintaining competitive visual quality on VBench.
arXiv Detail & Related papers (2026-02-02T08:47:33Z) - SoulX-FlashTalk: Real-Time Infinite Streaming of Audio-Driven Avatars via Self-Correcting Bidirectional Distillation [16.34443339642213]
textbfX-FlashTalk is a 14B-scale system to achieve a textbfsub-second start-up latency (0.87s) while reaching a real-time throughput of textbf32 FPS.<n>SoulX-FlashTalk is the first 14B-scale system to achieve a textbfsub-second start-up latency (0.87s) while reaching a real-time throughput of textbf32 FPS.
arXiv Detail & Related papers (2025-12-29T11:18:24Z) - Lightning Fast Caching-based Parallel Denoising Prediction for Accelerating Talking Head Generation [50.04968365065964]
Diffusion-based talking head models generate high-quality, photorealistic videos but suffer from slow inference.<n>We introduce Lightning-fast Caching-based Parallel denoising prediction (LightningCP)<n>We also propose Decoupled Foreground Attention (DFA) to further accelerate attention computations.
arXiv Detail & Related papers (2025-08-25T02:58:39Z) - A Lightweight Dual-Mode Optimization for Generative Face Video Coding [26.308480665852052]
Generative Face Video Coding (GFVC) achieves superior rate-distortion performance by leveraging the strong inference capabilities of deep generative models.<n>We propose a lightweight GFVC framework that introduces dual-mode optimization to reduce complexity whilst preserving reconstruction quality.<n> Experimental results demonstrate that the proposed lightweight dual-mode optimization for GFVC can achieve 90.4% parameter reduction and 88.9% saving compared to the baseline.
arXiv Detail & Related papers (2025-08-19T06:09:28Z) - BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation [27.57431718095974]
We present BLADE, a data-free joint training framework for video inference.<n>We show remarkable efficiency gains across different scales.<n>On models such as CogVideoX-5B with short video sequence lengths, our framework delivers a robust 8.89x speedup.
arXiv Detail & Related papers (2025-08-14T15:58:59Z) - Ultra3D: Efficient and High-Fidelity 3D Generation with Part Attention [54.15345846343084]
We propose Ultra3D, an efficient 3D generation framework that significantly accelerates sparse voxel modeling without compromising quality.<n>Part Attention is a geometry-aware localized attention mechanism that restricts attention computation within semantically consistent part regions.<n>Experiments demonstrate that Ultra3D supports high-resolution 3D generation at 1024 resolution and achieves state-of-the-art performance in both visual fidelity and user preference.
arXiv Detail & Related papers (2025-07-23T17:57:16Z) - CHORDS: Diffusion Sampling Accelerator with Multi-core Hierarchical ODE Solvers [72.23291099555459]
Diffusion-based generative models have become dominant generators of high-fidelity images and videos but remain limited by their computationally expensive inference procedures.<n>This paper explores a general, training-free, and model-agnostic acceleration strategy via multi-core parallelism.<n>ChoRDS significantly accelerates sampling across diverse large-scale image and video diffusion models, yielding up to 2.1x speedup with four cores, improving by 50% over baselines, and 2.9x speedup with eight cores, all without quality degradation.
arXiv Detail & Related papers (2025-07-21T05:48:47Z) - VMoBA: Mixture-of-Block Attention for Video Diffusion Models [29.183614108287276]
This paper introduces Video Mixture of Block Attention (VMoBA), a novel attention mechanism specifically adapted for Video Diffusion Models (VDMs)<n>Motivated by an in-depth analysis of attention patterns within pre-trained video transformers, VMoBA enhances the original MoBA framework with three key modifications.<n>Experiments demonstrate that VMoBA significantly accelerates the training of VDMs on longer sequences, achieving 2.92x FLOPs and 1.48x latency speedup.
arXiv Detail & Related papers (2025-06-30T13:52:31Z) - Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion [16.99620863197586]
Diffusion language models offer parallel token generation and inherent bidirectionality.<n>State-of-the-art diffusion models (e.g., Dream 7B, LLaDA 8B) suffer from slow inference.<n>We introduce Guided Diffusion, a training-free method that uses a lightweight pretrained autoregressive model to supervise token unmasking.<n>For the first time, diffusion language models achieve a comparable and even faster latency as the widely adopted autoregressive models.
arXiv Detail & Related papers (2025-05-27T17:39:39Z) - VORTA: Efficient Video Diffusion via Routing Sparse Attention [45.269274789183974]
Video Diffusion Transformers (VDiTs) have achieved remarkable progress in high-quality video generation, but remain computationally expensive.<n>We propose VORTA, an acceleration framework with two novel components.<n>It achieves a $1.76times$ end-to-end speedup without quality loss on VBench.
arXiv Detail & Related papers (2025-05-24T17:46:47Z) - Communication-Efficient Diffusion Denoising Parallelization via Reuse-then-Predict Mechanism [18.655659400456848]
Diffusion models have emerged as a powerful class of generative models across various modalities, including image, video, and audio synthesis.<n>We propose textbfParaStep, a novel parallelization method based on a reuse-then-predict mechanism that parallelizes diffusion inference by exploiting similarity between adjacent denoising steps.<n>ParaStep achieves end-to-end speedups of up to textbf3.88$times$ on SVD, textbf2.43$times$ on CogVideoX-2b, and textbf6.56$times
arXiv Detail & Related papers (2025-05-20T06:58:40Z) - H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models [97.45170082949552]
Autoencoder (AE) is the key to the success of latent diffusion models for image and video generation.<n>H3AE achieves ultra-high compression ratios and real-time decoding speed on GPU and mobile.
arXiv Detail & Related papers (2025-04-14T17:59:06Z) - Acc3D: Accelerating Single Image to 3D Diffusion Models via Edge Consistency Guided Score Distillation [49.202383675543466]
We present Acc3D to tackle the challenge of accelerating the diffusion process to generate 3D models from single images.<n>To derive high-quality reconstructions through few-step inferences, we emphasize the critical issue of regularizing the learning of score function in states of random noise.
arXiv Detail & Related papers (2025-03-20T09:18:10Z) - One-Step Diffusion Model for Image Motion-Deblurring [85.76149042561507]
We propose a one-step diffusion model for deblurring (OSDD), a novel framework that reduces the denoising process to a single step.<n>To tackle fidelity loss in diffusion models, we introduce an enhanced variational autoencoder (eVAE), which improves structural restoration.<n>Our method achieves strong performance on both full and no-reference metrics.
arXiv Detail & Related papers (2025-03-09T09:39:57Z) - Efficiency Meets Fidelity: A Novel Quantization Framework for Stable Diffusion [9.402892455344677]
We propose an efficient quantization framework for Stable Diffusion models (SDM)<n>Our framework simultaneously maintains training-inference consistency and ensures optimization stability.<n>Our method demonstrates superior performance over state-of-the-art approaches with shorter training times.
arXiv Detail & Related papers (2024-12-09T17:00:20Z) - Temporal Feature Matters: A Framework for Diffusion Model Quantization [105.3033493564844]
Diffusion models rely on the time-step for the multi-round denoising.<n>We introduce a novel quantization framework that includes three strategies.<n>This framework preserves most of the temporal information and ensures high-quality end-to-end generation.
arXiv Detail & Related papers (2024-07-28T17:46:15Z) - Data-Model-Circuit Tri-Design for Ultra-Light Video Intelligence on Edge
Devices [90.30316433184414]
We propose a data-model-hardware tri-design framework for high- throughput, low-cost, and high-accuracy MOT on HD video stream.
Compared to the state-of-the-art MOT baseline, our tri-design approach can achieve 12.5x latency reduction, 20.9x effective frame rate improvement, 5.83x lower power, and 9.78x better energy efficiency, without much accuracy drop.
arXiv Detail & Related papers (2022-10-16T16:21:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.