Related papers: A2SF: Accumulative Attention Scoring with Forgetting Factor for Token Pruning in Transformer Decoder

A2SF: Accumulative Attention Scoring with Forgetting Factor for Token Pruning in Transformer Decoder

URL: http://arxiv.org/abs/2407.20485v2
Date: Wed, 31 Jul 2024 02:02:40 GMT
Title: A2SF: Accumulative Attention Scoring with Forgetting Factor for Token Pruning in Transformer Decoder
Authors: Hyun-rae Jo, Dongkun Shin,
Abstract summary: We propose Accumulative Attention Score with Forgetting Factor (A2SF) technique, which introduces a Forgetting Factor in the Attention Score accumulation process. A2SF applies a penalty to the past Attention Score generated from old tokens by repeatedly multiplying the Forgetting Factor to the Attention Score over time. We have verified the accuracy improvement through A2SF in the OPT and LLaMA models and A2SF improves the accuracy of LLaMA 2 by up to 7.8% and 5.1% on 1-shot and 0-shot.
Score: 1.6114012813668932
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recently, large language models (LLM) based on transformers are facing memory bottleneck issues due to KV cache, especially in long sequence handling. Previous researches proposed KV cache compression techniques that identify insignificant tokens based on Accumulative Attention Scores and removes their items from KV cache, noting that only few tokens play an important role in attention operations. However, we have observed that the existing Accumulative Attention Score is not suitable for the transformer decoder structure. In the decoder model, the number of times the Attention Score accumulates varies depending on the order of token appearance due to the effect of masking, causing an uneven comparison between tokens. To solve this, we propose Accumulative Attention Score with Forgetting Factor (A2SF) technique, which introduces a Forgetting Factor in the Attention Score accumulation process. A2SF applies a penalty to the past Attention Score generated from old tokens by repeatedly multiplying the Forgetting Factor to the Attention Score over time. Therefore, older tokens receive a larger penalty, providing fairness among different ages of tokens. Through the fair comparison among tokens, we can more effectively select important tokens. We have verified the accuracy improvement through A2SF in the OPT and LLaMA models and A2SF improves the accuracy of LLaMA 2 by up to 7.8% and 5.1% on 1-shot and 0-shot.

Related papers

Spark Transformer: Reactivating Sparsity in FFN and Attention [63.20677098823873]
We introduce Spark Transformer, a novel architecture that achieves a high level of activation sparsity in both FFN and the attention mechanism.<n>This sparsity translates to a 2.5x reduction in FLOPs, leading to decoding wall-time speedups of up to 1.79x on CPU and 1.40x on GPU.
arXiv Detail & Related papers (2025-06-07T03:51:13Z)
Learning to Attribute with Attention [75.61481181755744]
We propose treating attention weights of different attention heads as features. This way, we can learn how to effectively leverage attention weights for attribution. Our resulting method, Attribution with Attention (AT2), reliably performs on par with approaches that involve many ablations.
arXiv Detail & Related papers (2025-04-18T15:36:28Z)
AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference [51.1972443343829]
We propose AttentionPredictor, which is the first learning-based critical token identification approach. AttentionPredictor accurately predicts the attention score while consuming negligible memory. We also propose a cross-token critical cache prefetching framework that hides the token time overhead to accelerate the decoding stage.
arXiv Detail & Related papers (2025-02-06T13:41:46Z)
Recycled Attention: Efficient inference for long-context language models [54.00118604124301]
We propose Recycled Attention, an inference-time method which alternates between full context attention and attention over a subset of input tokens. When performing partial attention, we recycle the attention pattern of a previous token that has performed full attention and attend only to the top K most attended tokens. Compared to previously proposed inference-time acceleration method which attends only to local context or tokens with high accumulative attention scores, our approach flexibly chooses tokens that are relevant to the current decoding step.
arXiv Detail & Related papers (2024-11-08T18:57:07Z)
ToSA: Token Selective Attention for Efficient Vision Transformers [50.13756218204456]
ToSA is a token selective attention approach that can identify tokens that need to be attended as well as those that can skip a transformer layer. We show that ToSA can significantly reduce computation costs while maintaining accuracy on the ImageNet classification benchmark.
arXiv Detail & Related papers (2024-06-13T05:17:21Z)
Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification [6.660834045805309]
Pre-trained transformers such as BERT suffer from a computationally expensive self-attention mechanism. We propose integrating two strategies: token pruning and token combining. Experiments with various datasets demonstrate superior performance compared to baseline models.
arXiv Detail & Related papers (2024-06-03T12:51:52Z)
Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens [65.4435926060951]
We propose to significantly improve the efficiency of Transformers for ultra long sequences, by compressing the sequence into a much smaller representation at each layer. Our algorithm is not only efficient (achieving more than $3times$ efficiency gain compared to baselines on 4K and 16K lengths) but also offers competitive/better performance on a large number of tasks.
arXiv Detail & Related papers (2023-05-07T10:32:18Z)
Efficient Video Action Detection with Token Dropout and Context Refinement [67.10895416008911]
We propose an end-to-end framework for efficient video action detection (ViTs) In a video clip, we maintain tokens from its perspective while preserving tokens relevant to actor motions from other frames. Second, we refine scene context by leveraging remaining tokens for better recognizing actor identities.
arXiv Detail & Related papers (2023-04-17T17:21:21Z)
Robustifying Token Attention for Vision Transformers [72.07710236246285]
Vision transformers (ViTs) still suffer from significant drops in accuracy in the presence of common corruptions. We propose two techniques to make attention more stable through two general techniques. First, our Token-aware Average Pooling (TAP) module encourages the local neighborhood of each token to take part in the attention mechanism. Second, we force the output tokens to aggregate information from a diverse set of input tokens rather than focusing on just a few.
arXiv Detail & Related papers (2023-03-20T14:04:40Z)
Input-length-shortening and text generation via attention values [1.8222946691865871]
We show that the first layer's attention sums can be used to filter tokens in a given sequence. We also show that retaining approximately 6% of the original sequence is sufficient to obtain 86.5% accuracy.
arXiv Detail & Related papers (2023-03-14T02:11:24Z)
Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers [32.972945618608726]
Vision transformers have achieved significant improvements on various vision tasks but their quadratic interactions between tokens significantly reduce computational efficiency. We propose an efficient token decoupling and merging method that can jointly consider the token importance and diversity for token pruning. Our method can even improve the accuracy of DeiT-T by 0.1% after reducing its FLOPs by 40%.
arXiv Detail & Related papers (2022-11-21T09:57:11Z)
Token-Label Alignment for Vision Transformers [93.58540411138164]
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs) We identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies. We propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token.
arXiv Detail & Related papers (2022-10-12T17:54:32Z)
Fine- and Coarse-Granularity Hybrid Self-Attention for Efficient BERT [22.904252855587348]
We propose a fine- and coarse-granularity hybrid self-attention that reduces the cost through progressively shortening the computational sequence length in self-attention. We show that FCA offers a significantly better trade-off between accuracy and FLOPs compared to prior methods.
arXiv Detail & Related papers (2022-03-17T03:33:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.