Related papers: QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification

QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification

URL: http://arxiv.org/abs/2509.23681v2
Date: Tue, 30 Sep 2025 03:14:18 GMT
Title: QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification
Authors: Weilun Feng, Chuanguang Yang, Haotong Qin, Mingqiang Wu, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu,
Abstract summary: Diffusion transformers exhibit remarkable video generation capability, yet their prohibitive computational and memory costs hinder practical deployment.<n>Model quantization and attention sparsification are two promising directions for compression, but each alone suffers severe performance degradation under aggressive compression.<n>We propose textbfQuantSparse, a unified framework that integrates model quantization with attention sparsification.
Score: 67.15451442018258
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion transformers exhibit remarkable video generation capability, yet their prohibitive computational and memory costs hinder practical deployment. Model quantization and attention sparsification are two promising directions for compression, but each alone suffers severe performance degradation under aggressive compression. Combining them promises compounded efficiency gains, but naive integration is ineffective. The sparsity-induced information loss exacerbates quantization noise, leading to amplified attention shifts. To address this, we propose \textbf{QuantSparse}, a unified framework that integrates model quantization with attention sparsification. Specifically, we introduce \textit{Multi-Scale Salient Attention Distillation}, which leverages both global structural guidance and local salient supervision to mitigate quantization-induced bias. In addition, we develop \textit{Second-Order Sparse Attention Reparameterization}, which exploits the temporal stability of second-order residuals to efficiently recover information lost under sparsity. Experiments on HunyuanVideo-13B demonstrate that QuantSparse achieves 20.88 PSNR, substantially outperforming the state-of-the-art quantization baseline Q-VDiT (16.85 PSNR), while simultaneously delivering a \textbf{3.68$\times$} reduction in storage and \textbf{1.88$\times$} acceleration in end-to-end inference. Our code will be released in https://github.com/wlfeng0509/QuantSparse.

Related papers

Hybrid Predictive Quantum Feedback: Extending Qubit Lifetimes Beyond the Wiseman-Milburn Limit [0.0]
Amplitude damping fundamentally limits qubit lifetimes by irreversibly leaking energy and information into the environment.<n>Standard Wiseman--Milburn feedback offers only modest improvement because it acts on a single measured quadrature and its corrective drive is degraded by loop delay.<n>We introduce a compact hybrid upgrade with two components: (i) a coherently coupled emphancilla qubit that receives the homodyne current and feeds back emphquantum-coherently on the system, recovering information from emphboth field quadratures and intentionally engineered to decay
arXiv Detail & Related papers (2025-11-15T09:45:47Z)
OmniSAT: Compact Action Token, Faster Auto Regression [70.70037017501357]
We introduce an Omni Swift Action Tokenizer, which learns a compact, transferable action representation.<n>The resulting discrete tokenization shortens the training sequence by 6.8$times$, and lowers the target entropy.
arXiv Detail & Related papers (2025-10-08T03:55:24Z)
S$^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation [32.895381997778586]
We propose S$2$Q-VDiT, a post-training quantization framework for video diffusion models (V-DMs)<n>Under W4A6 quantization, S$2$Q-VDiT achieves lossless performance while delivering $3.9times$ model compression and $1.3times$ inference acceleration.
arXiv Detail & Related papers (2025-08-06T02:12:29Z)
MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z)
SQuat: Subspace-orthogonal KV Cache Quantization [19.131705063324883]
We introduce SQuat (Subspace-orthogonal KV cache quantization), which reduces peak memory by 2.17 to 2.82, improves throughput by 2.45 to 3.60, and achieves more favorable benchmark scores than existing KV cache quantization algorithms.<n>We show that our method reduces peak memory by 2.17 to 2.82, improves throughput by 2.45 to 3.60, and achieves more favorable benchmark scores than existing KV cache quantization algorithms.
arXiv Detail & Related papers (2025-03-31T17:37:32Z)
MPQ-DM: Mixed Precision Quantization for Extremely Low Bit Diffusion Models [37.061975191553]
This paper presents MPQ-DM, a Mixed-Precision Quantization method for Diffusion Models.<n>To mitigate the quantization error caused by outlier severe weight channels, we propose an Outlier-Driven Mixed Quantization technique.<n>To robustly learn representations crossing time steps, we construct a Time-Smoothed Relation Distillation scheme.
arXiv Detail & Related papers (2024-12-16T08:31:55Z)
Continuous Speculative Decoding for Autoregressive Image Generation [27.308442169466975]
Continuous visual autoregressive (AR) models have demonstrated promising performance in image generation.<n> speculative decoding has effectively accelerated discrete autoregressive inference.<n>This work addresses challenges from low acceptance rate, inconsistent output distribution, and modified distribution without analytic expression.
arXiv Detail & Related papers (2024-11-18T09:19:15Z)
UNComp: Can Matrix Entropy Uncover Sparsity? -- A Compressor Design from an Uncertainty-Aware Perspective [85.08718140718707]
UNComp is an uncertainty-aware framework that uncovers sparsity patterns that can be used for adaptive compression.<n>By focusing on uncertainty to analyze the sparsity pattern in detail, UNComp reduces the KV cache size to 4.74% of the original, achieves a 6% prefill speedup, and improves throughput by 6.4x.
arXiv Detail & Related papers (2024-10-04T02:32:36Z)
Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models. Existing methods often compromise precision or require extra data for calibration. We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z)
QuEST: Low-bit Diffusion Model Quantization via Efficient Selective Finetuning [52.157939524815866]
In this paper, we identify imbalanced activation distributions as a primary source of quantization difficulty.<n>We propose to adjust these distributions through weight finetuning to be more quantization-friendly.<n>Our method demonstrates its efficacy across three high-resolution image generation tasks.
arXiv Detail & Related papers (2024-02-06T03:39:44Z)
HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions. We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.