ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals
- URL: http://arxiv.org/abs/2412.14363v2
- Date: Mon, 03 Feb 2025 21:45:32 GMT
- Title: ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals
- Authors: Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, Xin Wang,
- Abstract summary: Post-training quantization of large language models (LLMs) holds the promise in reducing the prohibitive computational cost at inference time.<n>We propose ResQ, a PTQ method that pushes further the state-of-the-art.<n>We demonstrate that ResQ outperforms recent uniform and mixed precision PTQ methods on a variety of benchmarks.
- Score: 10.860081994662645
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Post-training quantization (PTQ) of large language models (LLMs) holds the promise in reducing the prohibitive computational cost at inference time. Quantization of all weight, activation and key-value (KV) cache tensors to 4-bit without significantly degrading generalizability is challenging, due to the high quantization error caused by extreme outliers in activations. To tackle this problem, we propose ResQ, a PTQ method that pushes further the state-of-the-art. By means of principal component analysis (PCA), it identifies a low-rank subspace (in practice 1/8 of the hidden dimension) in which activation variances are highest, and keep the coefficients within this subspace in high precision, e.g. 8-bit, while quantizing the rest to 4-bit. Within each subspace, invariant random rotation is applied to further suppress outliers. We show that this is a provably optimal mixed precision quantization scheme that minimizes error. With the Llama and Qwen2.5 families of models, we demonstrate that ResQ outperforms recent uniform and mixed precision PTQ methods on a variety of benchmarks, achieving up to 33\% lower perplexity on Wikitext than the next best method SpinQuant, and upto 3\times speedup over 16-bit baseline. Code is available at https://github.com/utkarsh-dmx/project-resq.
Related papers
- BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models [56.504879072674015]
We propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients.<n>BPDQ enables serving Qwen2.5-72B on a single GTX 3090 with 83.85% GSM8K accuracy (vs. 90.83% at 16-bit)
arXiv Detail & Related papers (2026-02-04T02:54:37Z) - HeRo-Q: A General Framework for Stable Low Bit Quantization via Hessian Conditioning [5.407724832457912]
We propose the Hessian Robust Quantization (HeRo Q) algorithm, which applies a lightweight, learnable rotation-compression matrix to the weight space prior to quantization.<n> Experiments on Llama and Qwen models show that HeRo Q consistently outperforms state of the art methods including GPTQ, AWQ, and SpinQuant.
arXiv Detail & Related papers (2026-01-29T12:27:05Z) - Post-Training Quantization via Residual Truncation and Zero Suppression for Diffusion Models [10.000323762676633]
Diffusion models achieve high-quality image generation but face deployment challenges due to their high computational requirements.<n>We propose Quantization via Residual Truncation and Zero Suppression (QuaRTZ), a 4-bit PTQ scheme for diffusion models.<n>Our approach reduces rounding errors and improves quantization efficiency by balancing outlier preservation and LSB precision.
arXiv Detail & Related papers (2025-09-30T15:55:42Z) - SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights [8.95245917088986]
Post-training quantization has emerged as the most widely used strategy for deploying large language models at low precision.<n>Current methods show perplexity degradation at bit-widths less than or equal to 4, partly because representing outliers causes precision issues.<n>We introduce SINQ to augment existing post-training quantizers with an additional second-axis scale factor and a fast Sinkhorn-Knopp-style algorithm.
arXiv Detail & Related papers (2025-09-26T21:22:54Z) - Mix-QSAM: Mixed-Precision Quantization of the Segment Anything Model [0.0]
Mix-QSAM is a mixed-precision Post-Training Quantization (PTQ) framework for the Segment Anything Model (SAM)<n>We introduce a layer-wise importance score, derived using Kullback-Leibler (KL) divergence, to quantify each layer's contribution to the model's output.<n>We also introduce cross-layer synergy, a novel metric based on causal mutual information, to capture dependencies between adjacent layers.
arXiv Detail & Related papers (2025-05-08T00:08:31Z) - KurTail : Kurtosis-based LLM Quantization [51.24081396305435]
KurTail is a new post-training quantization scheme that mitigates outliers in the activations of large language models.
It offers a 13.3% boost in MMLU accuracy and a 15.5% drop in Wiki perplexity compared to QuaRot.
It also outperforms SpinQuant with a 2.6% MMLU gain and reduces perplexity by 2.9%, all while reducing the training cost.
arXiv Detail & Related papers (2025-03-03T12:43:06Z) - PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models [64.84734437930362]
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization.
We propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time.
Experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization.
arXiv Detail & Related papers (2025-02-18T08:04:58Z) - SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models [58.5019443418822]
Diffusion models have been proven highly effective at generating high-quality images.
As these models grow larger, they require significantly more memory and suffer from higher latency.
In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits.
arXiv Detail & Related papers (2024-11-07T18:59:58Z) - QERA: an Analytical Framework for Quantization Error Reconstruction [12.110441045050223]
An increasing interest in quantizing weights to extremely low precision while offsetting the resulting error with low-rank, high-precision error reconstruction terms.
The combination of quantization and low-rank approximation is now popular in both adapter-based, parameter-efficient fine-tuning methods.
We formulate an analytical framework, named Quantization Error Reconstruction Analysis (QERA), and offer a closed-form solution to the problem.
arXiv Detail & Related papers (2024-10-08T13:37:34Z) - MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization [16.83403134551842]
Recent few-step diffusion models reduce the inference time by reducing the denoising steps.
The Post Training Quantization (PTQ) replaces high bit-width FP representation with low-bit integer values.
However, when applying to few-step diffusion models, existing quantization methods face challenges in preserving both the image quality and text alignment.
arXiv Detail & Related papers (2024-05-28T06:50:58Z) - COMQ: A Backpropagation-Free Algorithm for Post-Training Quantization [8.214857267270807]
Post-training quantization (PTQ) has emerged as a practical approach to compress large neural networks.
We propose an innovative PTQ algorithm termed COMQ, which sequentially conducts coordinate-wise minimization of the layer-wise reconstruction errors.
COMQ achieves remarkable results in quantizing 4-bit Vision Transformers, with a negligible loss of less than 1% in Top-1 accuracy.
arXiv Detail & Related papers (2024-03-11T20:04:03Z) - CBQ: Cross-Block Quantization for Large Language Models [66.82132832702895]
Post-training quantization (PTQ) has played a key role in compressing large language models (LLMs) with ultra-low costs.
We propose CBQ, a cross-block reconstruction-based PTQ method for LLMs.
CBQ employs a cross-block dependency using a reconstruction scheme, establishing long-range dependencies across multiple blocks to minimize error accumulation.
arXiv Detail & Related papers (2023-12-13T07:56:27Z) - Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech
Recognition [65.7040645560855]
We propose Q-ASR, an integer-only, zero-shot quantization scheme for ASR models.
We show negligible WER change as compared to the full-precision baseline models.
Q-ASR exhibits a large compression rate of more than 4x with small WER degradation.
arXiv Detail & Related papers (2021-03-31T06:05:40Z) - HAWQV3: Dyadic Neural Network Quantization [73.11579145354801]
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values.
We present HAWQV3, a novel mixed-precision integer-only quantization framework.
arXiv Detail & Related papers (2020-11-20T23:51:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.