Related papers: ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation

ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation

URL: http://arxiv.org/abs/2406.02540v2
Date: Sun, 30 Jun 2024 14:41:22 GMT
Title: ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation
Authors: Tianchen Zhao, Tongcheng Fang, Enshu Liu, Rui Wan, Widyadewi Soedarmadji, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, Yu Wang,
Abstract summary: Post-Training Quantization (PTQ) is an effective method for reducing memory costs and computational complexity. We find that applying existing diffusion quantization methods for U-Net faces challenges in preserving quality. We improve ViDiT-Q with a novel metric-decoupled mixed-precision quantization method (ViDiT-Q-MP)
Score: 23.00085349135532
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion transformers (DiTs) have exhibited remarkable performance in visual generation tasks, such as generating realistic images or videos based on textual instructions. However, larger model sizes and multi-frame processing for video generation lead to increased computational and memory costs, posing challenges for practical deployment on edge devices. Post-Training Quantization (PTQ) is an effective method for reducing memory costs and computational complexity. When quantizing diffusion transformers, we find that applying existing diffusion quantization methods designed for U-Net faces challenges in preserving quality. After analyzing the major challenges for quantizing diffusion transformers, we design an improved quantization scheme: "ViDiT-Q": Video and Image Diffusion Transformer Quantization) to address these issues. Furthermore, we identify highly sensitive layers and timesteps hinder quantization for lower bit-widths. To tackle this, we improve ViDiT-Q with a novel metric-decoupled mixed-precision quantization method (ViDiT-Q-MP). We validate the effectiveness of ViDiT-Q across a variety of text-to-image and video models. While baseline quantization methods fail at W8A8 and produce unreadable content at W4A8, ViDiT-Q achieves lossless W8A8 quantization. ViDiTQ-MP achieves W4A8 with negligible visual quality degradation, resulting in a 2.5x memory optimization and a 1.5x latency speedup.

Related papers

PassionSR: Post-Training Quantization with Adaptive Scale in One-Step Diffusion based Image Super-Resolution [87.89013794655207]
Diffusion-based image super-resolution (SR) models have shown superior performance at the cost of multiple denoising steps. We propose a novel post-training quantization approach with adaptive scale in one-step diffusion (OSD) image SR, PassionSR. Our PassionSR achieves significant advantages over recent leading low-bit quantization methods for image SR.
arXiv Detail & Related papers (2024-11-26T04:49:42Z)
QVD: Post-training Quantization for Video Diffusion Models [33.13078954859106]
Post-training quantization (PTQ) is an effective technique to reduce memory footprint and improve computational efficiency. We introduce the first PTQ strategy tailored for video diffusion models, dubbed QVD. We achieve near-lossless performance degradation on W8A8, outperforming the current methods by 205.12 in FVD.
arXiv Detail & Related papers (2024-07-16T10:47:27Z)
Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers [45.762142897697366]
Post-Training Quantization (PTQ) emerges as a promising solution, enabling model compression and accelerated inference for pretrained models. Research on DiT quantization remains sparse, and existing PTQ frameworks tend to suffer from biased quantization, leading to notable performance degradation. We propose Q-DiT, a novel approach that seamlessly integrates two key techniques: automatic quantization granularity allocation to handle the significant variance of weights and activations across input channels, and sample-wise dynamic activation quantization to adaptively capture activation changes across both timesteps and samples.
arXiv Detail & Related papers (2024-06-25T07:57:27Z)
MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization [16.83403134551842]
Recent few-step diffusion models reduce the inference time by reducing the denoising steps. The Post Training Quantization (PTQ) replaces high bit-width FP representation with low-bit integer values. However, when applying to few-step diffusion models, existing quantization methods face challenges in preserving both the image quality and text alignment.
arXiv Detail & Related papers (2024-05-28T06:50:58Z)
PTQ4DiT: Post-training Quantization for Diffusion Transformers [52.902071948957186]
Post-training Quantization (PTQ) has emerged as a fast and data-efficient solution that can significantly reduce computation and memory footprint. We propose PTQ4DiT, a specifically designed PTQ method for DiTs. We demonstrate that our PTQ4DiT successfully quantizes DiTs to 8-bit precision while preserving comparable generation ability.
arXiv Detail & Related papers (2024-05-25T02:02:08Z)
Towards Accurate Post-Training Quantization for Vision Transformer [48.779346466374406]
Existing post-training quantization methods still cause severe performance drops. APQ-ViT surpasses the existing post-training quantization methods by convincing margins.
arXiv Detail & Related papers (2023-03-25T03:05:26Z)
Q-HyViT: Post-Training Quantization of Hybrid Vision Transformers with Bridge Block Reconstruction for IoT Systems [23.261607952479377]
Vision transformers (ViTs) have superseded convolutional neural networks in numerous applications, including classification, detection, and segmentation. We propose a new post-training quantization method, which is the first to quantize efficient hybrid ViTs. We achieve a significant improvement of 17.73% for 8-bit and 29.75% for 6-bit on average, compared with existing PTQ methods.
arXiv Detail & Related papers (2023-03-22T13:41:22Z)
RepQ-ViT: Scale Reparameterization for Post-Training Quantization of Vision Transformers [2.114921680609289]
We propose RepQ-ViT, a novel PTQ framework for vision transformers (ViTs) RepQ-ViT decouples the quantization and inference processes. It can outperform existing strong baselines and encouragingly improve the accuracy of 4-bit PTQ of ViTs to a usable level.
arXiv Detail & Related papers (2022-12-16T02:52:37Z)
Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT) Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z)
Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts. Existing approaches usually align and aggregate video frames from limited adjacent frames. We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z)
FQ-ViT: Fully Quantized Vision Transformer without Retraining [13.82845665713633]
We present a systematic method to reduce the performance degradation and inference complexity of Quantized Transformers. We are the first to achieve comparable accuracy degradation (1%) on fully quantized Vision Transformers.
arXiv Detail & Related papers (2021-11-27T06:20:53Z)
Post-Training Quantization for Vision Transformer [85.57953732941101]
We present an effective post-training quantization algorithm for reducing the memory storage and computational costs of vision transformers. We can obtain an 81.29% top-1 accuracy using DeiT-B model on ImageNet dataset with about 8-bit quantization.
arXiv Detail & Related papers (2021-06-27T06:27:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.