Related papers: Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization

Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization

URL: http://arxiv.org/abs/2406.12016v2
Date: Fri, 04 Oct 2024 06:26:20 GMT
Title: Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization
Authors: Seungwoo Son, Wonpyo Park, Woohyun Han, Kyuyeun Kim, Jaeho Lee,
Abstract summary: We develop a simple yet effective strategy to facilitate per-tensor activation quantization by preventing the generation of problematic tokens. We tune the token cache to regularize the activations of subsequent tokens to be more quantization-friendly. We thoroughly evaluate our method over a wide range of models and benchmarks and find that it significantly surpasses the established baseline of per-tensor W8A8 quantization.
Score: 13.475050661770796
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite recent advances in LLM quantization, activation quantization remains to be challenging due to the activation outliers. Conventional remedies, e.g., mixing precisions for different channels, introduce extra overhead and reduce the speedup. In this work, we develop a simple yet effective strategy to facilitate per-tensor activation quantization by preventing the generation of problematic tokens. Precisely, we propose a method to find a set of key-value cache, coined CushionCache, which mitigates outliers in subsequent tokens when inserted as a prefix. CushionCache works in two steps: First, we greedily search for a prompt token sequence that minimizes the maximum activation values in subsequent tokens. Then, we further tune the token cache to regularize the activations of subsequent tokens to be more quantization-friendly. The proposed method successfully addresses activation outliers of LLMs, providing a substantial performance boost for per-tensor activation quantization methods. We thoroughly evaluate our method over a wide range of models and benchmarks and find that it significantly surpasses the established baseline of per-tensor W8A8 quantization and can be seamlessly integrated with the recent activation quantization method.

Related papers

Multipole Attention for Efficient Long Context Reasoning [64.94673641704289]
Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks.<n>LRMs need to generate long chain-of-thought reasoning in order to think before answering.<n>We introduce Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens.
arXiv Detail & Related papers (2025-06-16T03:00:40Z)
Accurate KV Cache Quantization with Outlier Tokens Tracing [44.722738059962296]
KV Cache quantization presents a promising solution, striking a good balance between memory usage and accuracy.<n>Previous research has shown that the Keys are distributed by channel, while the Values are distributed by token.<n>Our method achieves significant accuracy improvements under 2-bit quantization and can deliver a 6.4 times reduction in memory usage and a 2.3 times increase in throughput.
arXiv Detail & Related papers (2025-05-16T07:23:12Z)
GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance [21.134233954419148]
Post-training quantization is a key technique for reducing the memory and inference latency of large language models.<n>We propose GuidedQuant, a novel quantization approach that integrates gradient information from the end loss into the quantization objective.<n> GuidedQuant consistently boosts the performance of state-of-the-art quantization methods across weight-only scalar, weight-only vector, and weight-and-activation quantization.
arXiv Detail & Related papers (2025-05-11T14:55:09Z)
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive. LCD can distort the global distribution over strings, sampling tokens based only on local information. We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z)
Quantize What Counts: Bit Allocation Insights Informed by Spectral Gaps in Keys and Values [57.54443445583921]
We provide two novel theorems aimed at enhancing KV quantization methods.<n>Our first theorem, termed Key-Value Norm Disparity, states that the key weight matrices by nature carry richer information.<n>Our second theorem, Key-Driven Quantization, posits that prioritizing the quantization precision of keys over values induces significant improvements to the overall quantization performance.
arXiv Detail & Related papers (2025-02-20T22:24:27Z)
QSpec: Speculative Decoding with Complementary Quantization Schemes [37.007621357142725]
Quantization has been substantially adopted to accelerate inference and reduce memory consumption of large language models. We propose a novel quantization paradigm called QSPEC, which seamlessly integrates two complementary quantization schemes for speculative decoding. QSPEC empirically boosts token generation throughput by up to 1.80x without any quality compromise.
arXiv Detail & Related papers (2024-10-15T05:57:51Z)
PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization [44.547992997369875]
We propose PrefixQuant, a novel quantization method that achieves state-of-the-art performance across various precision levels. First, PrefixQuant eliminates token-wise outliers by prefixing outlier tokens in the KV cache. Second, PrefixQuant introduces new trainable parameters for block-wise training to compensate for quantization error.
arXiv Detail & Related papers (2024-10-07T17:59:35Z)
Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens. Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z)
Mitigating Quantization Errors Due to Activation Spikes in GLU-Based LLMs [5.408684636210501]
Post-training quantization (PTQ) has become a popular approach, quantizing weights and activations to lower precision. We show the challenges of activation quantization in GLU variants, which are widely used in feed-forward network (FFN) of modern large language models. We propose two empirical methods, Quantization-free Module (QFeM) and Quantization-free Prefix (QFeP), to isolate the activation spikes during quantization.
arXiv Detail & Related papers (2024-05-23T10:54:14Z)
WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z)
An Efficient Rehearsal Scheme for Catastrophic Forgetting Mitigation during Multi-stage Fine-tuning [55.467047686093025]
A common approach to alleviate such forgetting is to rehearse samples from prior tasks during fine-tuning. We propose a sampling scheme, textttbf mix-cd, that prioritizes rehearsal of collateral damage'' samples. Our approach is computationally efficient, easy to implement, and outperforms several leading continual learning methods in compute-constrained settings.
arXiv Detail & Related papers (2024-02-12T22:32:12Z)
QuEST: Low-bit Diffusion Model Quantization via Efficient Selective Finetuning [52.157939524815866]
In this paper, we empirically unravel three properties in quantized diffusion models that compromise the efficacy of current methods. We identify two critical types of quantized layers: those holding vital temporal information and those sensitive to reduced bit-width. Our method is evaluated over three high-resolution image generation tasks and achieves state-of-the-art performance under various bit-width settings.
arXiv Detail & Related papers (2024-02-06T03:39:44Z)
Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge [45.690907522226794]
Large Language Models (LLMs) stand out for their impressive performance in intricate language modeling tasks. Recent works show that 8-bit or lower weight quantization is feasible with minimal impact on end-to-end task performance. We propose Agile-Quant, an activation-guided quantization framework for popular Large Language Models.
arXiv Detail & Related papers (2023-12-09T22:12:52Z)
QuantEase: Optimization-based Quantization for Language Models [17.333778751252392]
This work introduces Quantization (PTQ) of various quantization layers from recent advances of Large Language Models (LLMs) Our CD-based approach features straightforward updates, relying solely on vector operations. We also explore an outlier approach, allowing for retaining significant weights (outoutliers) with complete precision.
arXiv Detail & Related papers (2023-09-05T01:39:09Z)
PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models [52.09865918265002]
We propose a novel quantize before fine-tuning'' framework, PreQuant. PreQuant is compatible with various quantization strategies, with outlier-aware fine-tuning incorporated to correct the induced quantization error. We demonstrate the effectiveness of PreQuant on the GLUE benchmark using BERT, RoBERTa, and T5.
arXiv Detail & Related papers (2023-05-30T08:41:33Z)
Cluster-Promoting Quantization with Bit-Drop for Minimizing Network Quantization Loss [61.26793005355441]
Cluster-Promoting Quantization (CPQ) finds the optimal quantization grids for neural networks. DropBits is a new bit-drop technique that revises the standard dropout regularization to randomly drop bits instead of neurons. We experimentally validate our method on various benchmark datasets and network architectures.
arXiv Detail & Related papers (2021-09-05T15:15:07Z)
Accelerating BERT Inference for Sequence Labeling via Early-Exit [65.7292767360083]
We extend the recent successful early-exit mechanism to accelerate the inference of PTMs for sequence labeling tasks. We also propose a token-level early-exit mechanism that allows partial tokens to exit early at different layers. Our approach can save up to 66%-75% inference cost with minimal performance degradation.
arXiv Detail & Related papers (2021-05-28T14:39:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.