Related papers: Identifying Sensitive Weights via Post-quantization Integral

Identifying Sensitive Weights via Post-quantization Integral

URL: http://arxiv.org/abs/2503.01901v1
Date: Fri, 28 Feb 2025 07:04:19 GMT
Title: Identifying Sensitive Weights via Post-quantization Integral
Authors: Yuezhou Hu, Weiyu Huang, Zichen Liang, Chang Chen, Jintao Zhang, Jun Zhu, Jianfei Chen,
Abstract summary: We propose Post-quantization Integral (PQI) to estimate posterior sensitivity in a fine-grained manner.<n>We also propose ReQuant, a simple yet powerful framework that mainly consists of two-and-Sparse detach components.<n>Results show that ReQuant boosts state-of-the-art post-training quantization methods, with a pronounced improvement of 2.66 perplexity gain on Llama 3.2 1B with QTIP.
Score: 27.722950830077444
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Serving Large Language Models (LLMs) is costly. However, post-training weight quantization can address this problem by both compressing their sizes for limited memory and saving bandwidth for acceleration. As not all weight dimensions are equally important, those methods typically rely on a sensitivity metric, which indicates the element-wise influence of weights on loss function and is used to preprocess original weights for better quantization. In this work, we conduct an empirical study on the accuracy of the sensitivity metric, and find that existing gradient and Hessian based metrics are very inaccurate: they underestimate quantization's impact on the loss function by orders of magnitude, mainly due to the small convergence radius of local 2nd order approximation, \ie, gradient and Hessian term in Taylor's formula. To tackle this problem, we propose Post-quantization Integral (PQI), an accurate metric to estimate posterior sensitivity in a fine-grained manner. To leverage this accurate metric, we further propose ReQuant, a simple yet powerful framework that mainly consists of two Dense-and-Sparse detach components: self-adaptive outlier selection and step-wise significant weights detach. Results show that ReQuant boosts state-of-the-art post-training quantization methods, with a pronounced improvement of 2.66 perplexity gain on Llama 3.2 1B with QTIP.

Related papers

D$^2$Quant: Accurate Low-bit Post-Training Weight Quantization for LLMs [33.883527341335856]
Weight-only post-training quantization (PTQ) is appealing as it reduces memory usage and enables practical speedup without low-bit operators or specialized hardware.<n> accuracy often degrades significantly in weight-only PTQ at sub-4-bit precision.<n>We propose D$2$Quant, a novel weight-only PTQ framework that improves quantization from both the weight and activation perspectives.
arXiv Detail & Related papers (2026-01-30T05:49:48Z)
ECO: Quantized Training without Full-Precision Master Weights [58.97082407934466]
Error-Compensating (ECO) eliminates master weights by applying updates directly to quantized parameters.<n>We show that ECO converges to a constant-radius neighborhood of the optimum, while naive master-weight removal can incur an error that is inversely proportional to the learning rate.
arXiv Detail & Related papers (2026-01-29T18:35:01Z)
LoTA-QAF: Lossless Ternary Adaptation for Quantization-Aware Fine-Tuning [50.89500210372827]
Quantization and fine-tuning are crucial for deploying large language models (LLMs) on resource-constrained edge devices.<n>LoTA-QAF is a novel fine-tuning method specifically designed for quantized LLMs.<n>On the MMLU benchmark, our method effectively recovers performance for quantized models, surpassing 16-bit LoRA by up to 5.14%.
arXiv Detail & Related papers (2025-05-24T14:47:28Z)
GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance [21.134233954419148]
Post-training quantization is a key technique for reducing the memory and inference latency of large language models.<n>We propose GuidedQuant, a novel quantization approach that integrates gradient information from the end loss into the quantization objective.<n> GuidedQuant consistently boosts the performance of state-of-the-art quantization methods across weight-only scalar, weight-only vector, and weight-and-activation quantization.
arXiv Detail & Related papers (2025-05-11T14:55:09Z)
Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [65.37942405146232]
We present a novel type of overload that carries with extremely lightweight state elements, achieved through ultra-low-precision quantization. The proposed SOLO achieves substantial memory savings (approximately 45 GB when training a 7B model) with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z)
Compression Scaling Laws:Unifying Sparsity and Quantization [65.05818215339498]
We investigate how different compression techniques affect the scaling behavior of large language models (LLMs) during pretraining.<n>We show that weight-only quantization achieves strong parameter efficiency multipliers, while full quantization of both weights and activations shows diminishing returns at lower bitwidths.<n>Our results suggest that different compression techniques can be unified under a common scaling law framework.
arXiv Detail & Related papers (2025-02-23T04:47:36Z)
CondiQuant: Condition Number Based Low-Bit Quantization for Image Super-Resolution [59.91470739501034]
We propose CondiQuant, a condition number based low-bit post-training quantization for image super-resolution.<n>We show that CondiQuant outperforms existing state-of-the-art post-training quantization methods in accuracy without computation overhead.
arXiv Detail & Related papers (2025-02-21T14:04:30Z)
PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models [64.84734437930362]
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization.<n>We propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time.<n>Experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization.
arXiv Detail & Related papers (2025-02-18T08:04:58Z)
Pyramid Vector Quantization for LLMs [8.779688608449902]
Pyramid Vector Quantization (PVQ) for large language models.<n>PVQ uses a fixed integer lattice on the sphere by projecting points onto the 1-sphere, which allows for efficient encoding and decoding without requiring an explicit codebook in memory.<n>We achieve state-of-the-art quantization performance with pareto-optimal trade-off between performance and bits per weight and bits per activation, compared to compared methods.
arXiv Detail & Related papers (2024-10-22T11:57:32Z)
FlatQuant: Flatness Matters for LLM Quantization [58.28221892035609]
We propose FlatQuant, a new post-training quantization approach that enhances the flatness of weights and activations.<n>Our approach identifies optimal affine transformations for each linear layer, calibrated in hours via a lightweight objective.<n>It achieves less than 1% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7.5%.
arXiv Detail & Related papers (2024-10-12T08:10:28Z)
Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners [51.32182730502002]
We introduce Singular-value Diagonal Expansion to refine weight distributions to achieve better quantization alignment.<n>Our plug-and-play weight-quantization methods demonstrate substantial performance improvements over state-of-the-art approaches.
arXiv Detail & Related papers (2024-07-22T09:45:16Z)
Towards Accurate Post-Training Quantization of Vision Transformers via Error Reduction [48.740630807085566]
Post-training quantization (PTQ) for vision transformers (ViTs) has received increasing attention from both academic and industrial communities.<n>Current methods fail to account for the complex interactions between quantized weights and activations, resulting in significant quantization errors and suboptimal performance.<n>This paper presents ERQ, an innovative two-step PTQ method specifically crafted to reduce quantization errors arising from activation and weight quantization sequentially.
arXiv Detail & Related papers (2024-07-09T12:06:03Z)
OAC: Output-adaptive Calibration for Accurate Post-training Quantization [30.115888331426515]
Post-training Quantization (PTQ) techniques have been developed to compress Large Language Models (LLMs) Most PTQ approaches formulate the quantization error based on a calibrated layer-wise $ell$ loss. We propose Output-adaptive (OAC) to incorporate the model output in the calibration process.
arXiv Detail & Related papers (2024-05-23T20:01:17Z)
SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [63.118592279833656]
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs)<n>We propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise.<n> Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths.
arXiv Detail & Related papers (2024-05-23T16:21:48Z)
A2Q+: Improving Accumulator-Aware Weight Quantization [45.14832807541816]
Quantization techniques commonly reduce the inference costs of neural networks by restricting the precision of weights and activations. Recent work proposed accumulator-aware quantization (A2Q), a quantization-aware training method that constrains model weights during training to safely use a target accumulator bit width during inference. We introduce A2Q+, a new strategy for initializing quantized weights from pre-trained floating-point checkpoints.
arXiv Detail & Related papers (2024-01-19T00:27:34Z)
Mixed Precision of Quantization of Transformer Language Models for Speech Recognition [67.95996816744251]
State-of-the-art neural language models represented by Transformers are becoming increasingly complex and expensive for practical applications. Current low-bit quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of the system to quantization errors. The optimal local precision settings are automatically learned using two techniques. Experiments conducted on Penn Treebank (PTB) and a Switchboard corpus trained LF-MMI TDNN system.
arXiv Detail & Related papers (2021-11-29T09:57:00Z)
Post-training Quantization with Multiple Points: Mixed Precision without Mixed Precision [20.081543082708688]
We propose multipoint quantization, a method that approximates a full-precision weight vector using a linear combination of multiple vectors of low-bit numbers. We show that our method outperforms a range of state-of-the-art methods on ImageNet classification and it can be generalized to more challenging tasks like PASCAL VOC object detection.
arXiv Detail & Related papers (2020-02-20T22:37:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.