Related papers: Rotation and Permutation for Advanced Outlier Management and Efficient Quantization of LLMs

Rotation and Permutation for Advanced Outlier Management and Efficient Quantization of LLMs

URL: http://arxiv.org/abs/2406.01721v1
Date: Mon, 3 Jun 2024 18:27:44 GMT
Title: Rotation and Permutation for Advanced Outlier Management and Efficient Quantization of LLMs
Authors: Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, Ying Wei,
Abstract summary: Quantizing large language models (LLMs) presents significant challenges, primarily due to outlier activations. We propose DuQuant, an innovative quantization strategy employing rotation and permutation transformations to more effectively eliminate both types of outliers.
Score: 40.48697728884967
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Quantizing large language models (LLMs) presents significant challenges, primarily due to outlier activations that compromise the efficiency of low-bit representation. Traditional approaches mainly focus on solving Normal Outliers-activations with consistently high magnitudes across all tokens. However, these techniques falter when dealing with Massive Outliers, which are significantly higher in value and often cause substantial performance losses during low-bit quantization. In this study, we propose DuQuant, an innovative quantization strategy employing rotation and permutation transformations to more effectively eliminate both types of outliers. Initially, DuQuant constructs rotation matrices informed by specific outlier dimensions, redistributing these outliers across adjacent channels within different rotation blocks. Subsequently, a zigzag permutation is applied to ensure a balanced distribution of outliers among blocks, minimizing block-wise variance. An additional rotation further enhances the smoothness of the activation landscape, thereby improving model performance. DuQuant streamlines the quantization process and demonstrates superior outlier management, achieving top-tier results in multiple tasks with various LLM architectures even under 4-bit weight-activation quantization. Our code is available at https://github.com/Hsu1023/DuQuant.

Related papers

SmoothRot: Combining Channel-Wise Scaling and Rotation for Quantization-Friendly LLMs [0.0]
We present SmoothRot, a novel post-training quantization technique to enhance the efficiency of 4-bit quantization in Large Language Models (LLMs)<n>Our technique effectively transforms extreme outliers into quantization-friendly activations, significantly improving quantization accuracy.
arXiv Detail & Related papers (2025-06-04T19:07:45Z)
Gradual Binary Search and Dimension Expansion : A general method for activation quantization in LLMs [1.4999444543328293]
Large language models (LLMs) have become pivotal in artificial intelligence, demonstrating strong capabilities in reasoning, understanding, and generating data. Quantization is a widely used method to reduce memory usage and inference time, but LLMs present unique challenges due to the prevalence of outliers in their activations. We demonstrate that Hadamard matrices are more effective in reducing outliers, which are a significant obstacle in achieving low-bit quantization.
arXiv Detail & Related papers (2025-04-18T13:46:58Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
DFRot: Achieving Outlier-Free and Massive Activation-Free for Rotated LLMs with Refined Rotation [5.174900115018253]
We find substantial improvement in eliminating outliers for common tokens and achieve similar quantization error. Due to the extreme rarity of these tokens and their critical impact on model accuracy, we construct a simple yet effective method: a weighted loss function. Our method enhances the Rotated LLMs by achieving dual free, Outlier-Free and Massive Activation-Free, dubbed as DFRot.
arXiv Detail & Related papers (2024-12-01T02:55:08Z)
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models [58.5019443418822]
Diffusion models have been proven highly effective at generating high-quality images. As these models grow larger, they require significantly more memory and suffer from higher latency. In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits.
arXiv Detail & Related papers (2024-11-07T18:59:58Z)
FlatQuant: Flatness Matters for LLM Quantization [58.28221892035609]
We propose FlatQuant, a new post-training quantization approach that enhances the flatness of weights and activations.<n>Our approach identifies optimal affine transformations for each linear layer, calibrated in hours via a lightweight objective.<n>It achieves less than 1% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7.5%.
arXiv Detail & Related papers (2024-10-12T08:10:28Z)
Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference [54.2589824716527]
Large language models incur substantial computation and memory movement costs due to their large scale. Existing approaches separate outliers and normal values into two matrices or migrate outliers from activations to weights, suffering from high latency or accuracy degradation. We propose Rotated Smooth (RRS), a plug-and-play activation smoother for quantization, consisting of Smooth and Rotation operation. The proposed method outperforms the state-of-the-art method in the LLaMA and Qwen families and improves WikiText-2 perplexity from 57.33 to 6.66 for INT4 inference.
arXiv Detail & Related papers (2024-09-30T14:59:22Z)
OutlierTune: Efficient Channel-Wise Quantization for Large Language Models [24.645237670811476]
OutlierTune is an efficient per-channel post-training quantization method for the activations of large language models. The proposed framework is easy to implement and hardware-efficient, introducing almost no additional computational overheads during the inference.
arXiv Detail & Related papers (2024-06-27T02:02:26Z)
I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models [20.070306492164427]
Post-training quantization serves as a potent technique to accelerate the inference of large language models. Existing works still necessitate a considerable number of floating-point (FP) operations during inference. This limitation hinders the deployment of large language models on the edge and cloud devices. We propose I-LLM, a novel integer-only fully-quantized PTQ framework tailored for large language models.
arXiv Detail & Related papers (2024-05-28T05:56:11Z)
An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks. The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions. We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z)
Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models [7.485068491216164]
Large Language Models (LLMs) have recently demonstrated remarkable success across various tasks. Weight-only quantization can be a promising approach, but sub-4 bit quantization remains a challenge due to large-magnitude activation outliers. We propose per-IC quantization, a simple yet effective method that creates quantization groups within each input channel.
arXiv Detail & Related papers (2023-09-27T09:48:31Z)
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing [18.673619610942197]
Modern transformer models tend to learn strong outliers in their activations, making them difficult to quantize. We show that strong outliers are related to very specific behavior of attention heads that try to learn a "no-op" or just a partial update of the residual. We propose two simple (independent) modifications to the attention mechanism - clipped softmax and gated attention.
arXiv Detail & Related papers (2023-06-22T14:39:04Z)
SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference. We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
MBQuant: A Novel Multi-Branch Topology Method for Arbitrary Bit-width Network Quantization [51.85834744835766]
We propose MBQuant, a novel method for arbitrary bit-width quantization. We show that MBQuant achieves significant performance gains compared to existing arbitrary bit-width quantization methods.
arXiv Detail & Related papers (2023-05-14T10:17:09Z)
Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models [57.933500846742234]
Recent work recognizes that structured outliers are the critical bottleneck for quantization performance. We propose an outlier suppression framework including two components: Gamma Migration and Token-Wise Clipping. This framework effectively suppresses the outliers and can be used in a plug-and-play mode.
arXiv Detail & Related papers (2022-09-27T12:05:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.