Related papers: Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models

Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models

URL: http://arxiv.org/abs/2209.13325v1
Date: Tue, 27 Sep 2022 12:05:59 GMT
Title: Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models
Authors: Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang, Fengwei Yu, Xianglong Liu
Abstract summary: Recent work recognizes that structured outliers are the critical bottleneck for quantization performance. We propose an outlier suppression framework including two components: Gamma Migration and Token-Wise Clipping. This framework effectively suppresses the outliers and can be used in a plug-and-play mode.
Score: 57.933500846742234
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer architecture has become the fundamental element of the widespread natural language processing~(NLP) models. With the trends of large NLP models, the increasing memory and computation costs hinder their efficient deployment on resource-limited devices. Therefore, transformer quantization attracts wide research interest. Recent work recognizes that structured outliers are the critical bottleneck for quantization performance. However, their proposed methods increase the computation overhead and still leave the outliers there. To fundamentally address this problem, this paper delves into the inherent inducement and importance of the outliers. We discover that $\boldsymbol \gamma$ in LayerNorm (LN) acts as a sinful amplifier for the outliers, and the importance of outliers varies greatly where some outliers provided by a few tokens cover a large area but can be clipped sharply without negative impacts. Motivated by these findings, we propose an outlier suppression framework including two components: Gamma Migration and Token-Wise Clipping. The Gamma Migration migrates the outlier amplifier to subsequent modules in an equivalent transformation, contributing to a more quantization-friendly model without any extra burden. The Token-Wise Clipping takes advantage of the large variance of token range and designs a token-wise coarse-to-fine pipeline, obtaining a clipping range with minimal final quantization loss in an efficient way. This framework effectively suppresses the outliers and can be used in a plug-and-play mode. Extensive experiments prove that our framework surpasses the existing works and, for the first time, pushes the 6-bit post-training BERT quantization to the full-precision (FP) level. Our code is available at https://github.com/wimh966/outlier_suppression.

Related papers

Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive. LCD can distort the global distribution over strings, sampling tokens based only on local information. We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z)
DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs [40.48697728884967]
Quantization of large language models (LLMs) faces significant challenges, particularly due to the presence of outlier activations. Traditional approaches predominantly address Normal Outliers, which are activations across all tokens with relatively large magnitudes. We introduce DuQuant, a novel approach that utilizes rotation and permutation transformations to more effectively mitigate both massive and normal outliers.
arXiv Detail & Related papers (2024-06-03T18:27:44Z)
Language Model Cascades: Token-level uncertainty and beyond [65.38515344964647]
Recent advances in language models (LMs) have led to significant improvements in quality on complex NLP tasks. Cascading offers a simple strategy to achieve more favorable cost-quality tradeoffs. We show that incorporating token-level uncertainty through learned post-hoc deferral rules can significantly outperform simple aggregation strategies.
arXiv Detail & Related papers (2024-04-15T21:02:48Z)
Accurate Block Quantization in LLMs with Outliers [0.6138671548064355]
The demand for inference on extremely large scale LLMs has seen enormous growth in recent months. The problem is aggravated by the exploding raise in the lengths of the sequences being processed. Various quantization techniques have been proposed that allow accurate quantization for both weights and activations.
arXiv Detail & Related papers (2024-03-29T12:15:06Z)
Token Fusion: Bridging the Gap between Token Pruning and Token Merging [71.84591084401458]
Vision Transformers (ViTs) have emerged as powerful backbones in computer vision, outperforming many traditional CNNs. computational overhead, largely attributed to the self-attention mechanism, makes deployment on resource-constrained edge devices challenging. We introduce "Token Fusion" (ToFu), a method that amalgamates the benefits of both token pruning and token merging.
arXiv Detail & Related papers (2023-12-02T04:29:19Z)
Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models [7.485068491216164]
Large Language Models (LLMs) have recently demonstrated remarkable success across various tasks. Weight-only quantization can be a promising approach, but sub-4 bit quantization remains a challenge due to large-magnitude activation outliers. We propose per-IC quantization, a simple yet effective method that creates quantization groups within each input channel.
arXiv Detail & Related papers (2023-09-27T09:48:31Z)
SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference. We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling [44.60348333479704]
Post-training quantization of transformer language models faces challenges due to existence of detrimental outliers in activations. We propose the Outlier Suppression+(OS+) framework, which contains the channel-wise shifting for asymmetry and channel-wise scaling for concentration. We show that these operations can be seamlessly migrated into subsequent modules while maintaining equivalence.
arXiv Detail & Related papers (2023-04-18T17:34:23Z)
Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep. We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)
Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers [55.90468016961356]
We propose an efficient token mixer that learns to mix in the Fourier domain. AFNO is based on a principled foundation of operator learning. It can handle a sequence size of 65k and outperforms other efficient self-attention mechanisms.
arXiv Detail & Related papers (2021-11-24T05:44:31Z)
Understanding and Overcoming the Challenges of Efficient Transformer Quantization [17.05322956052278]
Transformer-based architectures have become the de-facto standard models for a wide range of Natural Language Processing tasks. However, their memory footprint and high latency are prohibitive for efficient deployment and inference on resource-limited devices. We show that transformers have unique quantization challenges -- namely, high dynamic activation ranges that are difficult to represent with a low bit fixed-point format.
arXiv Detail & Related papers (2021-09-27T10:57:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.