Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers
- URL: http://arxiv.org/abs/2505.22167v1
- Date: Wed, 28 May 2025 09:33:52 GMT
- Title: Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers
- Authors: Weilun Feng, Chuanguang Yang, Haotong Qin, Xiangqi Li, Yu Wang, Zhulin An, Libo Huang, Boyu Diao, Zixiang Zhao, Yongjun Xu, Michele Magno,
- Abstract summary: We present QDi-VT, a quantization framework specifically designed for video DiT models.<n>From the quantization perspective, we propose the Token-aware Quantization Estor (TQE), which compensates for quantization errors in both the token and feature dimensions.<n>Our W3A6 QDi-VT achieves a scene consistency of 23.40, setting a new benchmark and outperforming current state-of-the-art quantization methods by 1.9$times$.
- Score: 31.95947876513405
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion transformers (DiT) have demonstrated exceptional performance in video generation. However, their large number of parameters and high computational complexity limit their deployment on edge devices. Quantization can reduce storage requirements and accelerate inference by lowering the bit-width of model parameters. Yet, existing quantization methods for image generation models do not generalize well to video generation tasks. We identify two primary challenges: the loss of information during quantization and the misalignment between optimization objectives and the unique requirements of video generation. To address these challenges, we present Q-VDiT, a quantization framework specifically designed for video DiT models. From the quantization perspective, we propose the Token-aware Quantization Estimator (TQE), which compensates for quantization errors in both the token and feature dimensions. From the optimization perspective, we introduce Temporal Maintenance Distillation (TMD), which preserves the spatiotemporal correlations between frames and enables the optimization of each frame with respect to the overall video context. Our W3A6 Q-VDiT achieves a scene consistency of 23.40, setting a new benchmark and outperforming current state-of-the-art quantization methods by 1.9$\times$. Code will be available at https://github.com/cantbebetter2/Q-VDiT.
Related papers
- S$^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation [32.895381997778586]
We propose S$2$Q-VDiT, a post-training quantization framework for video diffusion models (V-DMs)<n>Under W4A6 quantization, S$2$Q-VDiT achieves lossless performance while delivering $3.9times$ model compression and $1.3times$ inference acceleration.
arXiv Detail & Related papers (2025-08-06T02:12:29Z) - MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z) - DVD-Quant: Data-free Video Diffusion Transformers Quantization [98.43940510241768]
Diffusion Transformers (DiTs) have emerged as the state-of-the-art architecture for video generation, yet their computational and memory demands hinder practical deployment.<n>We propose DVD-Quant, a novel Data-free quantization framework for Video DiTs.<n>Our approach integrates three key innovations: Progressive Bounded Quantization (PBQ) and (2) Auto-scaling Rotated Quantization (ARQ) for calibration data-free quantization error reduction.
arXiv Detail & Related papers (2025-05-24T11:56:02Z) - PMQ-VE: Progressive Multi-Frame Quantization for Video Enhancement [83.89668902758243]
Multi-frame video enhancement tasks aim to improve the spatial and temporal resolution and quality of video sequences.<n>We propose Progressive Multi-Frame Quantization for Video Enhancement (PMQ-VE)<n>This framework features a coarse-to-fine two-stage process: Backtracking-based Multi-Frame Quantization (BMFQ) and Progressive Multi-Teacher Distillation (PMTD)
arXiv Detail & Related papers (2025-05-18T07:10:40Z) - QuantCache: Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation [84.91431271257437]
Diffusion Transformers (DiTs) have emerged as a dominant architecture in video generation.<n>DiTs come with significant drawbacks, including increased computational and memory costs.<n>We propose QuantCache, a novel training-free inference acceleration framework.
arXiv Detail & Related papers (2025-03-09T10:31:51Z) - PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion based Image Super-Resolution [95.98801201266099]
Diffusion-based image super-resolution (SR) models have shown superior performance at the cost of multiple denoising steps.<n>We propose a novel post-training quantization approach with adaptive scale in one-step diffusion (OSD) image SR, PassionSR.<n>Our PassionSR achieves significant advantages over recent leading low-bit quantization methods for image SR.
arXiv Detail & Related papers (2024-11-26T04:49:42Z) - Temporal Feature Matters: A Framework for Diffusion Model Quantization [105.3033493564844]
Diffusion models rely on the time-step for the multi-round denoising.<n>We introduce a novel quantization framework that includes three strategies.<n>This framework preserves most of the temporal information and ensures high-quality end-to-end generation.
arXiv Detail & Related papers (2024-07-28T17:46:15Z) - ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation [23.99995355561429]
Post-Training Quantization (PTQ) is an effective method for reducing memory costs and computational complexity.<n>We introduce ViDiT-Q (Video & Image Diffusion Transformer Quantization), tailored specifically for DiT models.<n>We validate the effectiveness of ViDiT-Q across a variety of text-to-image and video models, achieving W8A8 and W4A8 with negligible degradation in visual quality and metrics.
arXiv Detail & Related papers (2024-06-04T17:57:10Z) - ResQ: Residual Quantization for Video Perception [18.491197847596283]
We propose a novel quantization scheme for video networks coined as Residual Quantization.
We extend our model to dynamically adjust the bit-width proportional to the amount of changes in the video.
arXiv Detail & Related papers (2023-08-18T12:41:10Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.