Related papers: VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers

VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers

URL: http://arxiv.org/abs/2408.17131v1
Date: Fri, 30 Aug 2024 09:15:54 GMT
Title: VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers
Authors: Juncan Deng, Shuaiting Li, Zeyu Wang, Hong Gu, Kedong Xu, Kejie Huang,
Abstract summary: Diffusion Transformers Models (DiTs) have transitioned the network architecture from traditional UNets to transformers, demonstrating exceptional capabilities in image generation. Vector quantization (VQ) can decompose model weight into a codebook and assignments, allowing extreme weight quantization and significantly reducing memory usage. We propose VQ4DiT, a fast post-training vector quantization method for DiTs. Experiments show that VQ4DiT establishes a new state-of-the-art in model size and performance trade-offs, quantizing weights to 2-bit precision while retaining acceptable image generation quality.
Score: 7.369445527610879
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Diffusion Transformers Models (DiTs) have transitioned the network architecture from traditional UNets to transformers, demonstrating exceptional capabilities in image generation. Although DiTs have been widely applied to high-definition video generation tasks, their large parameter size hinders inference on edge devices. Vector quantization (VQ) can decompose model weight into a codebook and assignments, allowing extreme weight quantization and significantly reducing memory usage. In this paper, we propose VQ4DiT, a fast post-training vector quantization method for DiTs. We found that traditional VQ methods calibrate only the codebook without calibrating the assignments. This leads to weight sub-vectors being incorrectly assigned to the same assignment, providing inconsistent gradients to the codebook and resulting in a suboptimal result. To address this challenge, VQ4DiT calculates the candidate assignment set for each weight sub-vector based on Euclidean distance and reconstructs the sub-vector based on the weighted average. Then, using the zero-data and block-wise calibration method, the optimal assignment from the set is efficiently selected while calibrating the codebook. VQ4DiT quantizes a DiT XL/2 model on a single NVIDIA A100 GPU within 20 minutes to 5 hours depending on the different quantization settings. Experiments show that VQ4DiT establishes a new state-of-the-art in model size and performance trade-offs, quantizing weights to 2-bit precision while retaining acceptable image generation quality.

Related papers

GPTQv2: Efficient Finetuning-Free Quantization for Asymmetric Calibration [21.474315621757594]
GPTQv2 is a finetuning-free quantization method for compressing large-scale transformer architectures. On a single GPU, we quantize a 405B language transformer and EVA-02 the rank first vision transformer that achieves 90% pretraining Imagenet accuracy.
arXiv Detail & Related papers (2025-04-03T15:30:43Z)
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models [58.5019443418822]
Diffusion models can generate high-quality images, but as they scale, rising memory demands and higher latency pose deployment challenges. We propose SVDQuant, a new 4-bit quantization paradigm to overcome this limitation. We reduce the memory usage for the 12B FLUX.1 models by 3.5$times$, achieving 3.0$times$ speedup over the 4-bit weight-only quantization (W4A16) baseline.
arXiv Detail & Related papers (2024-11-07T18:59:58Z)
GWQ: Gradient-Aware Weight Quantization for Large Language Models [61.17678373122165]
gradient-aware weight quantization (GWQ) is the first quantization approach for low-bit weight quantization that leverages gradients to localize outliers. GWQ retains the corresponding to the top 1% outliers preferentially at FP16 precision, while the remaining non-outlier weights are stored in a low-bit format. In the zero-shot task, GWQ quantized models have higher accuracy compared to other quantization methods.
arXiv Detail & Related papers (2024-10-30T11:16:04Z)
QTIP: Quantization with Trellises and Incoherence Processing [29.917017118524246]
Post-training quantization (PTQ) reduces the memory footprint of LLMs. Recent state-of-the-art PTQ approaches use vector quantization (VQ) to quantize multiple weights at once. We introduce QTIP, which instead uses trellis coded quantization (TCQ) to achieve ultra-high-dimensional quantization.
arXiv Detail & Related papers (2024-06-17T06:03:13Z)
ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation [23.00085349135532]
Post-Training Quantization (PTQ) is an effective method for reducing memory costs and computational complexity. We find that applying existing diffusion quantization methods for U-Net faces challenges in preserving quality. We improve ViDiT-Q with a novel metric-decoupled mixed-precision quantization method (ViDiT-Q-MP)
arXiv Detail & Related papers (2024-06-04T17:57:10Z)
STAT: Shrinking Transformers After Training [72.0726371426711]
We present STAT, a simple algorithm to prune transformer models without any fine-tuning. STAT eliminates both attention heads and neurons from the network, while preserving accuracy by calculating a correction to the weights of the next layer. Our entire algorithm takes minutes to compress BERT, and less than three hours to compress models with 7B parameters using a single GPU.
arXiv Detail & Related papers (2024-05-29T22:59:11Z)
GPTVQ: The Blessing of Dimensionality for LLM Quantization [16.585681547799762]
We show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs) Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE.
arXiv Detail & Related papers (2024-02-23T13:39:16Z)
Soft Convex Quantization: Revisiting Vector Quantization with Convex Optimization [40.1651740183975]
We propose Soft Convex Quantization (SCQ) as a direct substitute for Vector Quantization (VQ) SCQ works like a differentiable convex optimization (DCO) layer. We demonstrate its efficacy on the CIFAR-10, GTSRB and LSUN datasets.
arXiv Detail & Related papers (2023-10-04T17:45:14Z)
Improving Convergence for Quantum Variational Classifiers using Weight Re-Mapping [60.086820254217336]
In recent years, quantum machine learning has seen a substantial increase in the use of variational quantum circuits (VQCs) We introduce weight re-mapping for VQCs, to unambiguously map the weights to an interval of length $2pi$. We demonstrate that weight re-mapping increased test accuracy for the Wine dataset by $10%$ over using unmodified weights.
arXiv Detail & Related papers (2022-12-22T13:23:19Z)
Vertical Layering of Quantized Neural Networks for Heterogeneous Inference [57.42762335081385]
We study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one. We can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model.
arXiv Detail & Related papers (2022-12-10T15:57:38Z)
Power-of-Two Quantization for Low Bitwidth and Hardware Compliant Neural Networks [1.398698203665363]
In this paper, we explore non-linear quantization techniques for exploiting lower bit precision. We developed the Quantization Aware Training (QAT) algorithm that allowed training of low bit width Power-of-Two (PoT) networks. At the same time, PoT quantization vastly reduces the computational complexity of the neural network.
arXiv Detail & Related papers (2022-03-09T19:57:14Z)
Differentiable Model Compression via Pseudo Quantization Noise [99.89011673907814]
We propose to add independent pseudo quantization noise to model parameters during training to approximate the effect of a quantization operator. We experimentally verify that our method outperforms state-of-the-art quantization techniques on several benchmarks and architectures for image classification, language modeling, and audio source separation.
arXiv Detail & Related papers (2021-04-20T14:14:03Z)
Direct Quantization for Training Highly Accurate Low Bit-width Deep Neural Networks [73.29587731448345]
This paper proposes two novel techniques to train deep convolutional neural networks with low bit-width weights and activations. First, to obtain low bit-width weights, most existing methods obtain the quantized weights by performing quantization on the full-precision network weights. Second, to obtain low bit-width activations, existing works consider all channels equally.
arXiv Detail & Related papers (2020-12-26T15:21:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.