Related papers: DVD-Quant: Data-free Video Diffusion Transformers Quantization

DVD-Quant: Data-free Video Diffusion Transformers Quantization

URL: http://arxiv.org/abs/2505.18663v1
Date: Sat, 24 May 2025 11:56:02 GMT
Title: DVD-Quant: Data-free Video Diffusion Transformers Quantization
Authors: Zhiteng Li, Hanxuan Li, Junyi Wu, Kai Liu, Linghe Kong, Guihai Chen, Yulun Zhang, Xiaokang Yang,
Abstract summary: Diffusion Transformers (DiTs) have emerged as the state-of-the-art architecture for video generation, yet their computational and memory demands hinder practical deployment.<n>We propose DVD-Quant, a novel Data-free quantization framework for Video DiTs.<n>Our approach integrates three key innovations: Progressive Bounded Quantization (PBQ) and (2) Auto-scaling Rotated Quantization (ARQ) for calibration data-free quantization error reduction.
Score: 98.43940510241768
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion Transformers (DiTs) have emerged as the state-of-the-art architecture for video generation, yet their computational and memory demands hinder practical deployment. While post-training quantization (PTQ) presents a promising approach to accelerate Video DiT models, existing methods suffer from two critical limitations: (1) dependence on lengthy, computation-heavy calibration procedures, and (2) considerable performance deterioration after quantization. To address these challenges, we propose DVD-Quant, a novel Data-free quantization framework for Video DiTs. Our approach integrates three key innovations: (1) Progressive Bounded Quantization (PBQ) and (2) Auto-scaling Rotated Quantization (ARQ) for calibration data-free quantization error reduction, as well as (3) $\delta$-Guided Bit Switching ($\delta$-GBS) for adaptive bit-width allocation. Extensive experiments across multiple video generation benchmarks demonstrate that DVD-Quant achieves an approximately 2$\times$ speedup over full-precision baselines on HunyuanVideo while maintaining visual fidelity. Notably, DVD-Quant is the first to enable W4A4 PTQ for Video DiTs without compromising video quality. Code and models will be available at https://github.com/lhxcs/DVD-Quant.

Related papers

S$^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation [32.895381997778586]
We propose S$2$Q-VDiT, a post-training quantization framework for video diffusion models (V-DMs)<n>Under W4A6 quantization, S$2$Q-VDiT achieves lossless performance while delivering $3.9times$ model compression and $1.3times$ inference acceleration.
arXiv Detail & Related papers (2025-08-06T02:12:29Z)
Q-VDiT: Towards Accurate Quantization and Distillation of Video-Generation Diffusion Transformers [31.95947876513405]
We present QDi-VT, a quantization framework specifically designed for video DiT models.<n>From the quantization perspective, we propose the Token-aware Quantization Estor (TQE), which compensates for quantization errors in both the token and feature dimensions.<n>Our W3A6 QDi-VT achieves a scene consistency of 23.40, setting a new benchmark and outperforming current state-of-the-art quantization methods by 1.9$times$.
arXiv Detail & Related papers (2025-05-28T09:33:52Z)
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design [54.38970077613728]
Long-video understanding has emerged as a crucial capability in real-world applications such as video surveillance, meeting summarization, educational lecture analysis, and sports broadcasting.<n>We propose QuickVideo, a system-algorithm co-design that substantially accelerates long-video understanding to support real-time downstream applications.
arXiv Detail & Related papers (2025-05-22T03:26:50Z)
PMQ-VE: Progressive Multi-Frame Quantization for Video Enhancement [83.89668902758243]
Multi-frame video enhancement tasks aim to improve the spatial and temporal resolution and quality of video sequences.<n>We propose Progressive Multi-Frame Quantization for Video Enhancement (PMQ-VE)<n>This framework features a coarse-to-fine two-stage process: Backtracking-based Multi-Frame Quantization (BMFQ) and Progressive Multi-Teacher Distillation (PMTD)
arXiv Detail & Related papers (2025-05-18T07:10:40Z)
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model [133.01510927611452]
We present Step-Video-T2V, a text-to-video pre-trained model with 30Bational parameters and the ability to generate videos up to 204 frames in length.<n>A deep compression Vari Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios.<n>Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality.
arXiv Detail & Related papers (2025-02-14T15:58:10Z)
Adaptive Caching for Faster Video Generation with Diffusion Transformers [52.73348147077075]
Diffusion Transformers (DiTs) rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. We introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache) We also introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, controlling the compute allocation based on motion content.
arXiv Detail & Related papers (2024-11-04T18:59:44Z)
When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [118.72266141321647]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding.<n>During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes.<n>Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z)
Image and Video Tokenization with Binary Spherical Quantization [36.850958591333836]
We propose a new transformer-based image and video tokenizer with Binary Spherical Quantization (BSQ) BSQ projects the high-dimensional visual embedding to a lower-dimensional hypersphere and then applies binary quantization. Our tokenizer uses a transformer encoder and decoder with simple block-wise causal masking to support variable-length videos as input.
arXiv Detail & Related papers (2024-06-11T17:59:53Z)
ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation [23.99995355561429]
Post-Training Quantization (PTQ) is an effective method for reducing memory costs and computational complexity.<n>We introduce ViDiT-Q (Video & Image Diffusion Transformer Quantization), tailored specifically for DiT models.<n>We validate the effectiveness of ViDiT-Q across a variety of text-to-image and video models, achieving W8A8 and W4A8 with negligible degradation in visual quality and metrics.
arXiv Detail & Related papers (2024-06-04T17:57:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.