SmoothRot: Combining Channel-Wise Scaling and Rotation for Quantization-Friendly LLMs
- URL: http://arxiv.org/abs/2506.05413v2
- Date: Tue, 29 Jul 2025 15:28:07 GMT
- Title: SmoothRot: Combining Channel-Wise Scaling and Rotation for Quantization-Friendly LLMs
- Authors: Patrik Czakó, Gábor Kertész, Sándor Szénási,
- Abstract summary: We present SmoothRot, a novel post-training quantization technique to enhance the efficiency of 4-bit quantization in Large Language Models (LLMs)<n>Our technique effectively transforms extreme outliers into quantization-friendly activations, significantly improving quantization accuracy.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present SmoothRot, a novel post-training quantization technique to enhance the efficiency of 4-bit quantization in Large Language Models (LLMs). SmoothRot addresses the critical challenge of massive activation outliers, by integrating channel-wise scaling with Hadamard transformations. Our technique effectively transforms extreme outliers into quantization-friendly activations, significantly improving quantization accuracy. Experiments conducted on popular LLMs (LLaMA2 7B, LLaMA3.1 8B, and Mistral 7B) demonstrate that SmoothRot consistently reduces the performance gap between quantized and FP16 models by approximately 10-30\% across language generation and zero-shot reasoning tasks, without introducing additional inference latency. Code is available at https://github.com/czakop/smoothrot.
Related papers
- BASE-Q: Bias and Asymmetric Scaling Enhanced Rotational Quantization for Large Language Models [16.720321201956157]
BASE-Q is a simple yet powerful approach that combines bias correction and asymmetric scaling to reduce rounding and clipping errors.<n>Experiments demonstrate the effectiveness of BASE-Q, narrowing the accuracy gap to full-precision models by 50.5%, 42.9%, and 29.2% compared to QuaRot, SpinQuant, and OSTQuant, respectively.
arXiv Detail & Related papers (2025-05-26T14:22:21Z) - RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z) - OstQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting [20.944120156871108]
Post-training quantization (PTQ) has emerged as a widely adopted technique for compressing and accelerating Large Language Models (LLMs)<n>The major challenge in LLM quantization is that uneven and heavy-tailed data distributions can expand the quantization range, thereby reducing bit precision for most values.<n>We introduce Quantization Space Utilization Rate (BrotherQSUR), a novel metric that effectively assesses the quantizability of transformed data by measuring the space utilization of the data in the quantization space.
arXiv Detail & Related papers (2025-01-23T08:24:25Z) - DFRot: Achieving Outlier-Free and Massive Activation-Free for Rotated LLMs with Refined Rotation [5.174900115018253]
We find substantial improvement in eliminating outliers for common tokens and achieve similar quantization error.<n>Due to the extreme rarity of these tokens and their critical impact on model accuracy, we construct a simple yet effective method: a weighted loss function.<n>Our method enhances the Rotated LLMs by achieving dual free, Outlier-Free and Massive Activation-Free, dubbed as DFRot.
arXiv Detail & Related papers (2024-12-01T02:55:08Z) - FlatQuant: Flatness Matters for LLM Quantization [58.28221892035609]
We propose FlatQuant, a new post-training quantization approach that enhances the flatness of weights and activations.<n>Our approach identifies optimal affine transformations for each linear layer, calibrated in hours via a lightweight objective.<n>It achieves less than 1% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7.5%.
arXiv Detail & Related papers (2024-10-12T08:10:28Z) - Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference [54.2589824716527]
Large language models incur substantial computation and memory movement costs due to their large scale.
Existing approaches separate outliers and normal values into two matrices or migrate outliers from activations to weights, suffering from high latency or accuracy degradation.
We propose Rotated Smooth (RRS), a plug-and-play activation smoother for quantization, consisting of Smooth and Rotation operation.
The proposed method outperforms the state-of-the-art method in the LLaMA and Qwen families and improves WikiText-2 perplexity from 57.33 to 6.66 for INT4 inference.
arXiv Detail & Related papers (2024-09-30T14:59:22Z) - SpinQuant: LLM quantization with learned rotations [49.07335692298487]
Post-training quantization (PTQ) techniques applied to weights, activations, and the KV cache greatly reduce memory usage, latency, and power consumption of Large Language Models (LLMs)<n>We identify a collection of applicable rotation parameterizations that lead to identical outputs in full-precision Transformer architectures while enhancing quantization accuracy.<n>We propose SpinQuant, a novel approach that incorporates learned rotation matrices for optimal quantized network accuracy.
arXiv Detail & Related papers (2024-05-26T02:15:49Z) - QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs [72.26197676852958]
We introduce QuaRot, a new Quantization scheme based on Rotations.
QuaRot quantizes end-to-end, including all weights, activations, and KV cache in 4 bits.
Our 4-bit quantized LLaMa2-70B model has losses of at most 0.47 WikiText-2 perplexity and retains 99% of the zero-shot performance.
arXiv Detail & Related papers (2024-03-30T19:20:06Z) - AffineQuant: Affine Transformation Quantization for Large Language Models [58.45460102764]
Post-Training Quantization (PTQ) has emerged as a subject of considerable interest due to its compression efficiency and cost-effectiveness in the context of training.
Existing PTQ methods for Large-scale Language Models (LLMs) limit the optimization scope to scaling transformations between pre- and post-quantization weights.
In this paper, we advocate for the direct optimization using equivalent Affine transformations in PTQ (AffineQuant)
arXiv Detail & Related papers (2024-03-19T08:40:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.