PTQ4DiT: Post-training Quantization for Diffusion Transformers
- URL: http://arxiv.org/abs/2405.16005v3
- Date: Thu, 17 Oct 2024 15:28:15 GMT
- Title: PTQ4DiT: Post-training Quantization for Diffusion Transformers
- Authors: Junyi Wu, Haoxuan Wang, Yuzhang Shang, Mubarak Shah, Yan Yan,
- Abstract summary: Post-training Quantization (PTQ) has emerged as a fast and data-efficient solution that can significantly reduce computation and memory footprint.
We propose PTQ4DiT, a specifically designed PTQ method for DiTs.
We demonstrate that our PTQ4DiT successfully quantizes DiTs to 8-bit precision while preserving comparable generation ability.
- Score: 52.902071948957186
- License:
- Abstract: The recent introduction of Diffusion Transformers (DiTs) has demonstrated exceptional capabilities in image generation by using a different backbone architecture, departing from traditional U-Nets and embracing the scalable nature of transformers. Despite their advanced capabilities, the wide deployment of DiTs, particularly for real-time applications, is currently hampered by considerable computational demands at the inference stage. Post-training Quantization (PTQ) has emerged as a fast and data-efficient solution that can significantly reduce computation and memory footprint by using low-bit weights and activations. However, its applicability to DiTs has not yet been explored and faces non-trivial difficulties due to the unique design of DiTs. In this paper, we propose PTQ4DiT, a specifically designed PTQ method for DiTs. We discover two primary quantization challenges inherent in DiTs, notably the presence of salient channels with extreme magnitudes and the temporal variability in distributions of salient activation over multiple timesteps. To tackle these challenges, we propose Channel-wise Salience Balancing (CSB) and Spearmen's $\rho$-guided Salience Calibration (SSC). CSB leverages the complementarity property of channel magnitudes to redistribute the extremes, alleviating quantization errors for both activations and weights. SSC extends this approach by dynamically adjusting the balanced salience to capture the temporal variations in activation. Additionally, to eliminate extra computational costs caused by PTQ4DiT during inference, we design an offline re-parameterization strategy for DiTs. Experiments demonstrate that our PTQ4DiT successfully quantizes DiTs to 8-bit precision (W8A8) while preserving comparable generation ability and further enables effective quantization to 4-bit weight precision (W4A8) for the first time.
Related papers
- DiTAS: Quantizing Diffusion Transformers via Enhanced Activation Smoothing [5.174900115018253]
We propose a data-free post-training quantization (PTQ) method for efficient Diffusion Transformers (DiTs)
DiTAS relies on the proposed temporal-aggregated smoothing techniques to mitigate the impact of the channel-wise outliers within the input activations.
We show that our approach enables 4-bit weight, 8-bit activation (W4A8) quantization for DiTs while maintaining comparable performance as the full-precision model.
arXiv Detail & Related papers (2024-09-12T05:18:57Z) - Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers [45.762142897697366]
Post-Training Quantization (PTQ) emerges as a promising solution, enabling model compression and accelerated inference for pretrained models.
Research on DiT quantization remains sparse, and existing PTQ frameworks tend to suffer from biased quantization, leading to notable performance degradation.
We propose Q-DiT, a novel approach that seamlessly integrates two key techniques: automatic quantization granularity allocation to handle the significant variance of weights and activations across input channels, and sample-wise dynamic activation quantization to adaptively capture activation changes across both timesteps and samples.
arXiv Detail & Related papers (2024-06-25T07:57:27Z) - 2DQuant: Low-bit Post-Training Quantization for Image Super-Resolution [83.09117439860607]
Low-bit quantization has become widespread for compressing image super-resolution (SR) models for edge deployment.
It is notorious that low-bit quantization degrades the accuracy of SR models compared to their full-precision (FP) counterparts.
We present a dual-stage low-bit post-training quantization (PTQ) method for image super-resolution, namely 2DQuant, which achieves efficient and accurate SR under low-bit quantization.
arXiv Detail & Related papers (2024-06-10T06:06:11Z) - ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation [23.00085349135532]
Post-Training Quantization (PTQ) is an effective method for reducing memory costs and computational complexity.
We find that applying existing diffusion quantization methods for U-Net faces challenges in preserving quality.
We improve ViDiT-Q with a novel metric-decoupled mixed-precision quantization method (ViDiT-Q-MP)
arXiv Detail & Related papers (2024-06-04T17:57:10Z) - HQ-DiT: Efficient Diffusion Transformer with FP4 Hybrid Quantization [10.307268005739202]
Diffusion Transformers (DiTs) have recently gained substantial attention for their superior visual generation capabilities.
DiTs also come with high parameter counts and implementation costs, seriously restricting their use on resource-limited devices such as mobile phones.
We introduce the Hybrid Floating-point Quantization for DiT(HQ-DiT), an efficient post-training quantization method that utilizes 4-bit floating-point (FP) precision on both weights and activations for DiT inference.
arXiv Detail & Related papers (2024-05-30T06:56:11Z) - Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization [62.15918574997175]
It is known that language models contain outlier channels whose values on average are orders of magnitude higher than other channels.
We propose a strategy which regularizes a layer's inputs via quantization-aware training (QAT) and its outputs via activation kurtosis regularization.
We show that regularizing both the inputs and outputs is crucial for preventing a model's "migrating" the difficulty in input quantization to the weights.
arXiv Detail & Related papers (2024-04-04T17:25:30Z) - Q-Diffusion: Quantizing Diffusion Models [52.978047249670276]
Post-training quantization (PTQ) is considered a go-to compression method for other tasks.
We propose a novel PTQ method specifically tailored towards the unique multi-timestep pipeline and model architecture.
We show that our proposed method is able to quantize full-precision unconditional diffusion models into 4-bit while maintaining comparable performance.
arXiv Detail & Related papers (2023-02-08T19:38:59Z) - RepQ-ViT: Scale Reparameterization for Post-Training Quantization of
Vision Transformers [2.114921680609289]
We propose RepQ-ViT, a novel PTQ framework for vision transformers (ViTs)
RepQ-ViT decouples the quantization and inference processes.
It can outperform existing strong baselines and encouragingly improve the accuracy of 4-bit PTQ of ViTs to a usable level.
arXiv Detail & Related papers (2022-12-16T02:52:37Z) - PAMS: Quantized Super-Resolution via Parameterized Max Scale [84.55675222525608]
Deep convolutional neural networks (DCNNs) have shown dominant performance in the task of super-resolution (SR)
We propose a new quantization scheme termed PArameterized Max Scale (PAMS), which applies the trainable truncated parameter to explore the upper bound of the quantization range adaptively.
Experiments demonstrate that the proposed PAMS scheme can well compress and accelerate the existing SR models such as EDSR and RDN.
arXiv Detail & Related papers (2020-11-09T06:16:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.