Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization
- URL: http://arxiv.org/abs/2406.12016v2
- Date: Fri, 04 Oct 2024 06:26:20 GMT
- Title: Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization
- Authors: Seungwoo Son, Wonpyo Park, Woohyun Han, Kyuyeun Kim, Jaeho Lee,
- Abstract summary: We develop a simple yet effective strategy to facilitate per-tensor activation quantization by preventing the generation of problematic tokens.
We tune the token cache to regularize the activations of subsequent tokens to be more quantization-friendly.
We thoroughly evaluate our method over a wide range of models and benchmarks and find that it significantly surpasses the established baseline of per-tensor W8A8 quantization.
- Score: 13.475050661770796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite recent advances in LLM quantization, activation quantization remains to be challenging due to the activation outliers. Conventional remedies, e.g., mixing precisions for different channels, introduce extra overhead and reduce the speedup. In this work, we develop a simple yet effective strategy to facilitate per-tensor activation quantization by preventing the generation of problematic tokens. Precisely, we propose a method to find a set of key-value cache, coined CushionCache, which mitigates outliers in subsequent tokens when inserted as a prefix. CushionCache works in two steps: First, we greedily search for a prompt token sequence that minimizes the maximum activation values in subsequent tokens. Then, we further tune the token cache to regularize the activations of subsequent tokens to be more quantization-friendly. The proposed method successfully addresses activation outliers of LLMs, providing a substantial performance boost for per-tensor activation quantization methods. We thoroughly evaluate our method over a wide range of models and benchmarks and find that it significantly surpasses the established baseline of per-tensor W8A8 quantization and can be seamlessly integrated with the recent activation quantization method.
Related papers
- Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.
LCD can distort the global distribution over strings, sampling tokens based only on local information.
We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z) - QSpec: Speculative Decoding with Complementary Quantization Schemes [37.007621357142725]
Quantization has been substantially adopted to accelerate inference and reduce memory consumption of large language models.
We propose a novel quantization paradigm called QSPEC, which seamlessly integrates two complementary quantization schemes for speculative decoding.
QSPEC empirically boosts token generation throughput by up to 1.80x without any quality compromise.
arXiv Detail & Related papers (2024-10-15T05:57:51Z) - PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization [44.547992997369875]
We propose PrefixQuant, a novel quantization method that achieves state-of-the-art performance across various precision levels.
First, PrefixQuant eliminates token-wise outliers by prefixing outlier tokens in the KV cache.
Second, PrefixQuant introduces new trainable parameters for block-wise training to compensate for quantization error.
arXiv Detail & Related papers (2024-10-07T17:59:35Z) - Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.
Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z) - Mitigating Quantization Errors Due to Activation Spikes in GLU-Based LLMs [5.408684636210501]
Post-training quantization (PTQ) has become a popular approach, quantizing weights and activations to lower precision.
We show the challenges of activation quantization in GLU variants, which are widely used in feed-forward network (FFN) of modern large language models.
We propose two empirical methods, Quantization-free Module (QFeM) and Quantization-free Prefix (QFeP), to isolate the activation spikes during quantization.
arXiv Detail & Related papers (2024-05-23T10:54:14Z) - WKVQuant: Quantizing Weight and Key/Value Cache for Large Language
Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process.
This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z) - An Efficient Rehearsal Scheme for Catastrophic Forgetting Mitigation during Multi-stage Fine-tuning [55.467047686093025]
A common approach to alleviate such forgetting is to rehearse samples from prior tasks during fine-tuning.
We propose a sampling scheme, textttbf mix-cd, that prioritizes rehearsal of collateral damage'' samples.
Our approach is computationally efficient, easy to implement, and outperforms several leading continual learning methods in compute-constrained settings.
arXiv Detail & Related papers (2024-02-12T22:32:12Z) - QuEST: Low-bit Diffusion Model Quantization via Efficient Selective Finetuning [52.157939524815866]
In this paper, we empirically unravel three properties in quantized diffusion models that compromise the efficacy of current methods.
We identify two critical types of quantized layers: those holding vital temporal information and those sensitive to reduced bit-width.
Our method is evaluated over three high-resolution image generation tasks and achieves state-of-the-art performance under various bit-width settings.
arXiv Detail & Related papers (2024-02-06T03:39:44Z) - Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs
on the Edge [45.690907522226794]
Large Language Models (LLMs) stand out for their impressive performance in intricate language modeling tasks.
Recent works show that 8-bit or lower weight quantization is feasible with minimal impact on end-to-end task performance.
We propose Agile-Quant, an activation-guided quantization framework for popular Large Language Models.
arXiv Detail & Related papers (2023-12-09T22:12:52Z) - QuantEase: Optimization-based Quantization for Language Models [17.333778751252392]
This work introduces Quantization (PTQ) of various quantization layers from recent advances of Large Language Models (LLMs)
Our CD-based approach features straightforward updates, relying solely on vector operations.
We also explore an outlier approach, allowing for retaining significant weights (outoutliers) with complete precision.
arXiv Detail & Related papers (2023-09-05T01:39:09Z) - PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language
Models [52.09865918265002]
We propose a novel quantize before fine-tuning'' framework, PreQuant.
PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error.
We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
arXiv Detail & Related papers (2023-05-30T08:41:33Z) - Cluster-Promoting Quantization with Bit-Drop for Minimizing Network
Quantization Loss [61.26793005355441]
Cluster-Promoting Quantization (CPQ) finds the optimal quantization grids for neural networks.
DropBits is a new bit-drop technique that revises the standard dropout regularization to randomly drop bits instead of neurons.
We experimentally validate our method on various benchmark datasets and network architectures.
arXiv Detail & Related papers (2021-09-05T15:15:07Z) - Accelerating BERT Inference for Sequence Labeling via Early-Exit [65.7292767360083]
We extend the recent successful early-exit mechanism to accelerate the inference of PTMs for sequence labeling tasks.
We also propose a token-level early-exit mechanism that allows partial tokens to exit early at different layers.
Our approach can save up to 66%-75% inference cost with minimal performance degradation.
arXiv Detail & Related papers (2021-05-28T14:39:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.