Related papers: Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification

Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification

URL: http://arxiv.org/abs/2406.01283v1
Date: Mon, 3 Jun 2024 12:51:52 GMT
Title: Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification
Authors: Jungmin Yun, Mihyeon Kim, Youngbin Kim,
Abstract summary: Pre-trained transformers such as BERT suffer from a computationally expensive self-attention mechanism. We propose integrating two strategies: token pruning and token combining. Experiments with various datasets demonstrate superior performance compared to baseline models.
Score: 6.660834045805309
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformer-based models have achieved dominant performance in numerous NLP tasks. Despite their remarkable successes, pre-trained transformers such as BERT suffer from a computationally expensive self-attention mechanism that interacts with all tokens, including the ones unfavorable to classification performance. To overcome these challenges, we propose integrating two strategies: token pruning and token combining. Token pruning eliminates less important tokens in the attention mechanism's key and value as they pass through the layers. Additionally, we adopt fuzzy logic to handle uncertainty and alleviate potential mispruning risks arising from an imbalanced distribution of each token's importance. Token combining, on the other hand, condenses input sequences into smaller sizes in order to further compress the model. By integrating these two approaches, we not only improve the model's performance but also reduce its computational demands. Experiments with various datasets demonstrate superior performance compared to baseline models, especially with the best improvement over the existing BERT model, achieving +5%p in accuracy and +5.6%p in F1 score. Additionally, memory cost is reduced to 0.61x, and a speedup of 1.64x is achieved.

Related papers

Block-based Symmetric Pruning and Fusion for Efficient Vision Transformers [11.916258576313776]
Vision Transformer (ViT) has achieved impressive results across various vision tasks.<n>Recent methods have aimed to reduce ViT's $O(n2)$ complexity by pruning unimportant tokens.<n>We introduce a novel bf Block-based Symmetric Pruning and Fusion for efficient ViT.
arXiv Detail & Related papers (2025-07-16T10:48:56Z)
Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models. We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z)
TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters [102.1116808722299]
We introduce TokenFormer, a scalable architecture for scaling Transformers. By treating model parameters as tokens, we replace all the linear projections in Transformers. Our model scales from 124M to 1.4B parameters by incrementally adding new key-value parameter pairs.
arXiv Detail & Related papers (2024-10-30T16:19:00Z)
MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More [71.0473038084673]
We propose MC-MoE, a training-free Mixture-Compressor for Mixture-of-Experts large language models (MoE-LLMs) MC-MoE leverages the significance of both experts and tokens to achieve an extreme compression. For instance, at 2.54 bits, MC-MoE compresses 76.6% of the model, with only a 3.8% average accuracy loss.
arXiv Detail & Related papers (2024-10-08T18:09:38Z)
Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning [63.43972993473501]
Token compression expedites the training and inference of Vision Transformers (ViTs) However, when applied to downstream tasks, compression degrees are mismatched between training and inference stages. We propose a model arithmetic framework to decouple the compression degrees between the two stages.
arXiv Detail & Related papers (2024-08-13T10:36:43Z)
Token Fusion: Bridging the Gap between Token Pruning and Token Merging [71.84591084401458]
Vision Transformers (ViTs) have emerged as powerful backbones in computer vision, outperforming many traditional CNNs. computational overhead, largely attributed to the self-attention mechanism, makes deployment on resource-constrained edge devices challenging. We introduce "Token Fusion" (ToFu), a method that amalgamates the benefits of both token pruning and token merging.
arXiv Detail & Related papers (2023-12-02T04:29:19Z)
Laughing Hyena Distillery: Extracting Compact Recurrences From Convolutions [101.08706223326928]
Recent advances in attention-free sequence models rely on convolutions as alternatives to the attention operator at the core of Transformers. In this paper, we seek to enable $mathcal O(1)$ compute and memory cost per token in any pre-trained long convolution architecture.
arXiv Detail & Related papers (2023-10-28T18:40:03Z)
Multi-Scale And Token Mergence: Make Your ViT More Efficient [3.087140219508349]
Vision Transformer (ViT) has emerged as a prevalent model in the computer vision domain. We propose a novel token pruning method that retains information from non-crucial tokens by merging them with more crucial tokens. Our method achieves a remarkable 33% reduction in computational costs while only incurring a 0.1% decrease in accuracy on DeiT-S.
arXiv Detail & Related papers (2023-06-08T02:58:15Z)
CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation. Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens. Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z)
Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers [32.972945618608726]
Vision transformers have achieved significant improvements on various vision tasks but their quadratic interactions between tokens significantly reduce computational efficiency. We propose an efficient token decoupling and merging method that can jointly consider the token importance and diversity for token pruning. Our method can even improve the accuracy of DeiT-T by 0.1% after reducing its FLOPs by 40%.
arXiv Detail & Related papers (2022-11-21T09:57:11Z)
Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention [36.90363317158731]
We propose an adaptive sparse token pruning framework with a minimal cost. Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy.
arXiv Detail & Related papers (2022-09-28T03:07:32Z)
Token Dropping for Efficient BERT Pretraining [33.63507016806947]
We develop a simple but effective "token dropping" method to accelerate the pretraining of transformer models. We leverage the already built-in masked language modeling (MLM) loss to identify unimportant tokens with practically no computational overhead. This simple approach reduces the pretraining cost of BERT by 25% while achieving similar overall fine-tuning performance on standard downstream tasks.
arXiv Detail & Related papers (2022-03-24T17:50:46Z)
Fine- and Coarse-Granularity Hybrid Self-Attention for Efficient BERT [22.904252855587348]
We propose a fine- and coarse-granularity hybrid self-attention that reduces the cost through progressively shortening the computational sequence length in self-attention. We show that FCA offers a significantly better trade-off between accuracy and FLOPs compared to prior methods.
arXiv Detail & Related papers (2022-03-17T03:33:47Z)
Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers [55.90468016961356]
We propose an efficient token mixer that learns to mix in the Fourier domain. AFNO is based on a principled foundation of operator learning. It can handle a sequence size of 65k and outperforms other efficient self-attention mechanisms.
arXiv Detail & Related papers (2021-11-24T05:44:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.