Related papers: Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference

Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference

URL: http://arxiv.org/abs/2409.20361v2
Date: Mon, 11 Nov 2024 12:45:51 GMT
Title: Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference
Authors: Ke Yi, Zengke Liu, Jianwei Zhang, Chengyuan Li, Tong Zhang, Junyang Lin, Jingren Zhou,
Abstract summary: Large language models incur substantial computation and memory movement costs due to their large scale. Existing approaches separate outliers and normal values into two matrices or migrate outliers from activations to weights, suffering from high latency or accuracy degradation. We propose Rotated Smooth (RRS), a plug-and-play activation smoother for quantization, consisting of Smooth and Rotation operation. The proposed method outperforms the state-of-the-art method in the LLaMA and Qwen families and improves WikiText-2 perplexity from 57.33 to 6.66 for INT4 inference.
Score: 54.2589824716527
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models have demonstrated promising capabilities upon scaling up parameters. However, serving large language models incurs substantial computation and memory movement costs due to their large scale. Quantization methods have been employed to reduce service costs and latency. Nevertheless, outliers in activations hinder the development of INT4 weight-activation quantization. Existing approaches separate outliers and normal values into two matrices or migrate outliers from activations to weights, suffering from high latency or accuracy degradation. Based on observing activations from large language models, outliers can be classified into channel-wise and spike outliers. In this work, we propose Rotated Runtime Smooth (RRS), a plug-and-play activation smoother for quantization, consisting of Runtime Smooth and the Rotation operation. Runtime Smooth (RS) is introduced to eliminate channel-wise outliers by smoothing activations with channel-wise maximums during runtime. The rotation operation can narrow the gap between spike outliers and normal values, alleviating the effect of victims caused by channel-wise smoothing. The proposed method outperforms the state-of-the-art method in the LLaMA and Qwen families and improves WikiText-2 perplexity from 57.33 to 6.66 for INT4 inference.

Related papers

DMQ: Dissecting Outliers of Diffusion Models for Post-Training Quantization [29.066284789131494]
Recent post-training quantization methods overlook outliers, leading to degraded performance at low bit-widths.<n>We propose a DMQ which combines Learned Equivalent Scaling and channel-wise Power-of-Two Scaling.<n>Our method significantly outperforms existing works, especially at low bit-widths.
arXiv Detail & Related papers (2025-07-17T09:15:29Z)
SmoothRot: Combining Channel-Wise Scaling and Rotation for Quantization-Friendly LLMs [0.0]
We present SmoothRot, a novel post-training quantization technique to enhance the efficiency of 4-bit quantization in Large Language Models (LLMs)<n>Our technique effectively transforms extreme outliers into quantization-friendly activations, significantly improving quantization accuracy.
arXiv Detail & Related papers (2025-06-04T19:07:45Z)
BASE-Q: Bias and Asymmetric Scaling Enhanced Rotational Quantization for Large Language Models [16.720321201956157]
BASE-Q is a simple yet powerful approach that combines bias correction and asymmetric scaling to reduce rounding and clipping errors.<n>Experiments demonstrate the effectiveness of BASE-Q, narrowing the accuracy gap to full-precision models by 50.5%, 42.9%, and 29.2% compared to QuaRot, SpinQuant, and OSTQuant, respectively.
arXiv Detail & Related papers (2025-05-26T14:22:21Z)
Turning LLM Activations Quantization-Friendly [0.0]
Quantization effectively reduces the serving costs of Large Language Models (LLMs) by speeding up data movement through compressed parameters and enabling faster operations via integer arithmetic.<n>However, activating integer arithmetic requires quantizing both weights and activations, which poses challenges due to the significant outliers in LLMs that increase quantization error.<n>In this work, we investigate these outliers with an emphasis on their effect on layer-wise quantization error, then examine how smoothing and rotation transform the observed values.
arXiv Detail & Related papers (2025-05-11T17:13:55Z)
R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference [77.47238561728459]
R-Sparse is a training-free activation sparsity approach capable of achieving high sparsity levels in advanced LLMs. Experiments on Llama-2/3 and Mistral models across ten diverse tasks demonstrate that R-Sparse achieves comparable performance at 50% model-level sparsity.
arXiv Detail & Related papers (2025-04-28T03:30:32Z)
QUAD: Quantization and Parameter-Efficient Tuning of LLM with Activation Decomposition [21.13478769431063]
QUAD (Quantization with Activation Decomposition) is a framework leveraging Singular Value Decomposition (SVD) to suppress activation outliers for effective 4-bit quantization. We show QUAD achieves 94% 96% accuracy under W4A4 quantization and 98% accuracy with W4A4/A8 and parameter-efficient fine-tuning for Llama-3 and Qwen-2.5 models.
arXiv Detail & Related papers (2025-03-25T05:03:56Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
DFRot: Achieving Outlier-Free and Massive Activation-Free for Rotated LLMs with Refined Rotation [5.174900115018253]
We find substantial improvement in eliminating outliers for common tokens and achieve similar quantization error. Due to the extreme rarity of these tokens and their critical impact on model accuracy, we construct a simple yet effective method: a weighted loss function. Our method enhances the Rotated LLMs by achieving dual free, Outlier-Free and Massive Activation-Free, dubbed as DFRot.
arXiv Detail & Related papers (2024-12-01T02:55:08Z)
OutlierTune: Efficient Channel-Wise Quantization for Large Language Models [24.645237670811476]
OutlierTune is an efficient per-channel post-training quantization method for the activations of large language models. The proposed framework is easy to implement and hardware-efficient, introducing almost no additional computational overheads during the inference.
arXiv Detail & Related papers (2024-06-27T02:02:26Z)
TernaryLLM: Ternarized Large Language Model [29.29122031050894]
Large language models (LLMs) have achieved remarkable performance on Natural Language Processing (NLP) tasks. We introduce Dual Learnable Ternarization (DLT), which enables both scales and shifts to be learnable. We also propose Outlier-Friendly Feature Knowledge Distillation (OFF) to recover the information lost in extremely low-bit quantization.
arXiv Detail & Related papers (2024-06-11T11:40:12Z)
DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs [40.48697728884967]
Quantization of large language models (LLMs) faces significant challenges, particularly due to the presence of outlier activations. Traditional approaches predominantly address Normal Outliers, which are activations across all tokens with relatively large magnitudes. We introduce DuQuant, a novel approach that utilizes rotation and permutation transformations to more effectively mitigate both massive and normal outliers.
arXiv Detail & Related papers (2024-06-03T18:27:44Z)
SpinQuant: LLM quantization with learned rotations [49.07335692298487]
Post-training quantization (PTQ) techniques applied to weights, activations, and the KV cache greatly reduce memory usage, latency, and power consumption of Large Language Models (LLMs) We identify a collection of applicable rotation parameterizations that lead to identical outputs in full-precision Transformer architectures while enhancing quantization accuracy. We propose SpinQuant, a novel approach that incorporates learned rotation matrices for optimal quantized network accuracy.
arXiv Detail & Related papers (2024-05-26T02:15:49Z)
FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation. To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed. We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z)
QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models [44.515165695546614]
Quantization-Aware Training (QAT) offers a solution, but its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for Large Language Models (LLMs) We propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs.
arXiv Detail & Related papers (2023-10-12T05:25:49Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
Scaling the Convex Barrier with Sparse Dual Algorithms [141.4085318878354]
We present two novel dual algorithms for tight and efficient neural network bounding. Both methods recover the strengths of the new relaxation: tightness and a linear separation oracle. We can obtain better bounds than off-the-shelf solvers in only a fraction of their running time.
arXiv Detail & Related papers (2021-01-14T19:45:17Z)
AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning. It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights. In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.