Related papers: Turning LLM Activations Quantization-Friendly

Turning LLM Activations Quantization-Friendly

URL: http://arxiv.org/abs/2506.01967v1
Date: Sun, 11 May 2025 17:13:55 GMT
Title: Turning LLM Activations Quantization-Friendly
Authors: Patrik Czakó, Gábor Kertész, Sándor Szénási,
Abstract summary: Quantization effectively reduces the serving costs of Large Language Models (LLMs) by speeding up data movement through compressed parameters and enabling faster operations via integer arithmetic.<n>However, activating integer arithmetic requires quantizing both weights and activations, which poses challenges due to the significant outliers in LLMs that increase quantization error.<n>In this work, we investigate these outliers with an emphasis on their effect on layer-wise quantization error, then examine how smoothing and rotation transform the observed values.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Quantization effectively reduces the serving costs of Large Language Models (LLMs) by speeding up data movement through compressed parameters and enabling faster operations via integer arithmetic. However, activating integer arithmetic requires quantizing both weights and activations, which poses challenges due to the significant outliers in LLMs that increase quantization error. In this work, we investigate these outliers with an emphasis on their effect on layer-wise quantization error, then examine how smoothing and rotation transform the observed values. Our primary contributions include introducing a new metric to measure and visualize quantization difficulty based on channel magnitudes, as well as proposing a hybrid approach that applies channel-wise scaling before rotation, supported by a mathematical formulation of its benefits.

Related papers

BASE-Q: Bias and Asymmetric Scaling Enhanced Rotational Quantization for Large Language Models [16.720321201956157]
BASE-Q is a simple yet powerful approach that combines bias correction and asymmetric scaling to reduce rounding and clipping errors.<n>Experiments demonstrate the effectiveness of BASE-Q, narrowing the accuracy gap to full-precision models by 50.5%, 42.9%, and 29.2% compared to QuaRot, SpinQuant, and OSTQuant, respectively.
arXiv Detail & Related papers (2025-05-26T14:22:21Z)
Q-MambaIR: Accurate Quantized Mamba for Efficient Image Restoration [34.43633070396096]
State-Space Models (SSMs) have attracted considerable attention in Image Restoration (IR)<n>Q-MambaIR is an accurate, efficient, and flexible Quantized Mamba for IR tasks.
arXiv Detail & Related papers (2025-03-27T20:34:11Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
FlatQuant: Flatness Matters for LLM Quantization [58.28221892035609]
We propose FlatQuant, a new post-training quantization approach that enhances the flatness of weights and activations.<n>Our approach identifies optimal affine transformations for each linear layer, calibrated in hours via a lightweight objective.<n>It achieves less than 1% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7.5%.
arXiv Detail & Related papers (2024-10-12T08:10:28Z)
Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference [54.2589824716527]
Large language models incur substantial computation and memory movement costs due to their large scale. Existing approaches separate outliers and normal values into two matrices or migrate outliers from activations to weights, suffering from high latency or accuracy degradation. We propose Rotated Smooth (RRS), a plug-and-play activation smoother for quantization, consisting of Smooth and Rotation operation. The proposed method outperforms the state-of-the-art method in the LLaMA and Qwen families and improves WikiText-2 perplexity from 57.33 to 6.66 for INT4 inference.
arXiv Detail & Related papers (2024-09-30T14:59:22Z)
Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners [51.32182730502002]
We introduce Singular-value Diagonal Expansion to refine weight distributions to achieve better quantization alignment.<n>Our plug-and-play weight-quantization methods demonstrate substantial performance improvements over state-of-the-art approaches.
arXiv Detail & Related papers (2024-07-22T09:45:16Z)
PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks [4.827161693957252]
Non-quantized elementwise operations dominate the inference cost of low-precision models. PikeLPN model addresses these issues by applying quantization to both elementwise operations and multiply-accumulate operations.
arXiv Detail & Related papers (2024-03-29T18:23:34Z)
Magic for the Age of Quantized DNNs [0.6008132390640294]
We introduce a novel normalization (Layer-Batch Normalization) that is independent of the mini-batch size and does not require any additional cost during inference. We also quantize activation functions using the same function and apply surrogate gradients to train the model with both quantized weights and the quantized activation functions.
arXiv Detail & Related papers (2024-03-22T07:21:09Z)
What Makes Quantization for Large Language Models Hard? An Empirical Study from the Lens of Perturbation [55.153595212571375]
Quantization is a technique for improving the memory and computational efficiency of large language models (LLMs) We propose a new perspective on quantization, viewing it as perturbations added to the weights and activations of LLMs. We conduct experiments with various artificial perturbations to explore their impact on LLM performance.
arXiv Detail & Related papers (2024-03-11T03:42:51Z)
WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z)
ApiQ: Finetuning of 2-Bit Quantized Large Language Model [12.328293460903911]
ApiQ is designed to restore the lost information from quantization by concurrently initializing the LoRA components and quantizing the weights of LLMs. It consistently achieves superior finetuning results across various bit-widths.
arXiv Detail & Related papers (2024-02-07T09:36:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.