Related papers: Fine- and Coarse-Granularity Hybrid Self-Attention for Efficient BERT

Fine- and Coarse-Granularity Hybrid Self-Attention for Efficient BERT

URL: http://arxiv.org/abs/2203.09055v1
Date: Thu, 17 Mar 2022 03:33:47 GMT
Title: Fine- and Coarse-Granularity Hybrid Self-Attention for Efficient BERT
Authors: Jing Zhao, Yifan Wang, Junwei Bao, Youzheng Wu, Xiaodong He
Abstract summary: We propose a fine- and coarse-granularity hybrid self-attention that reduces the cost through progressively shortening the computational sequence length in self-attention. We show that FCA offers a significantly better trade-off between accuracy and FLOPs compared to prior methods.
Score: 22.904252855587348
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Transformer-based pre-trained models, such as BERT, have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, deploying these models can be prohibitively costly, as the standard self-attention mechanism of the Transformer suffers from quadratic computational cost in the input sequence length. To confront this, we propose FCA, a fine- and coarse-granularity hybrid self-attention that reduces the computation cost through progressively shortening the computational sequence length in self-attention. Specifically, FCA conducts an attention-based scoring strategy to determine the informativeness of tokens at each layer. Then, the informative tokens serve as the fine-granularity computing units in self-attention and the uninformative tokens are replaced with one or several clusters as the coarse-granularity computing units in self-attention. Experiments on GLUE and RACE datasets show that BERT with FCA achieves 2x reduction in FLOPs over original BERT with <1% loss in accuracy. We show that FCA offers a significantly better trade-off between accuracy and FLOPs compared to prior methods.

Related papers

PRISM: Distributed Inference for Foundation Models at Edge [73.54372283220444]
PRISM is a communication-efficient and compute-aware strategy for distributed Transformer inference on edge devices.<n>We evaluate PRISM on ViT, BERT, and GPT-2 across diverse datasets.
arXiv Detail & Related papers (2025-07-16T11:25:03Z)
Spark Transformer: Reactivating Sparsity in FFN and Attention [63.20677098823873]
We introduce Spark Transformer, a novel architecture that achieves a high level of activation sparsity in both FFN and the attention mechanism.<n>This sparsity translates to a 2.5x reduction in FLOPs, leading to decoding wall-time speedups of up to 1.79x on CPU and 1.40x on GPU.
arXiv Detail & Related papers (2025-06-07T03:51:13Z)
Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models. We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z)
BEExformer: A Fast Inferencing Transformer Architecture via Binarization with Multiple Early Exits [2.7651063843287718]
Large Language Models (LLMs) based on transformers achieve cutting-edge results on a variety of applications. Among various efficiency considerations, model binarization and Early Exit (EE) are common effective solutions. We propose Binarized Early Exit Transformer (BEExformer), the first-ever selective learning transformer architecture.
arXiv Detail & Related papers (2024-12-06T17:58:14Z)
Continual Low-Rank Scaled Dot-product Attention [67.11704350478475]
We introduce a new formulation of the Scaled-product Attention based on the Nystr"om approximation that is suitable for Continual Inference. In experiments on Online Audio Classification and Online Action Detection tasks, the proposed Continual Scaled Dot-product Attention can lower the number of operations by up to three orders of magnitude.
arXiv Detail & Related papers (2024-12-04T11:05:01Z)
Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification [6.660834045805309]
Pre-trained transformers such as BERT suffer from a computationally expensive self-attention mechanism. We propose integrating two strategies: token pruning and token combining. Experiments with various datasets demonstrate superior performance compared to baseline models.
arXiv Detail & Related papers (2024-06-03T12:51:52Z)
TPC-ViT: Token Propagation Controller for Efficient Vision Transformer [6.341420717393898]
Vision transformers (ViTs) have achieved promising results on a variety of Computer Vision tasks. Previous approaches that employ gradual token reduction to address this challenge assume that token redundancy in one layer implies redundancy in all the following layers. We propose a novel token propagation controller (TPC) that incorporates two different token-distributions.
arXiv Detail & Related papers (2024-01-03T00:10:33Z)
DPBERT: Efficient Inference for BERT based on Dynamic Planning [11.680840266488884]
Existing input-adaptive inference methods fail to take full advantage of the structure of BERT. We propose Dynamic Planning in BERT, a novel fine-tuning strategy that can accelerate the inference process of BERT. Our method reduces latency to 75% while maintaining 98% accuracy, yielding a better accuracy-speed trade-off compared to state-of-the-art input-adaptive methods.
arXiv Detail & Related papers (2023-07-26T07:18:50Z)
Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks. We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z)
Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference [18.308180927492643]
ToP is a ranking-distilled token distillation technique, which distills effective token rankings from the final layer of unpruned models to early layers of pruned models. ToP reduces the average FLOPs of BERT by 8.1x while achieving competitive accuracy on GLUE, and provides a real latency speedup of up to 7.4x on an Intel CPU.
arXiv Detail & Related papers (2023-06-26T03:06:57Z)
CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation. Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens. Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z)
Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention [36.90363317158731]
We propose an adaptive sparse token pruning framework with a minimal cost. Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy.
arXiv Detail & Related papers (2022-09-28T03:07:32Z)
Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models [57.933500846742234]
Recent work recognizes that structured outliers are the critical bottleneck for quantization performance. We propose an outlier suppression framework including two components: Gamma Migration and Token-Wise Clipping. This framework effectively suppresses the outliers and can be used in a plug-and-play mode.
arXiv Detail & Related papers (2022-09-27T12:05:59Z)
BiBERT: Accurate Fully Binarized BERT [69.35727280997617]
BiBERT is an accurate fully binarized BERT to eliminate the performance bottlenecks. Our method yields impressive 56.3 times and 31.2 times saving on FLOPs and model size.
arXiv Detail & Related papers (2022-03-12T09:46:13Z)
Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers [55.90468016961356]
We propose an efficient token mixer that learns to mix in the Fourier domain. AFNO is based on a principled foundation of operator learning. It can handle a sequence size of 65k and outperforms other efficient self-attention mechanisms.
arXiv Detail & Related papers (2021-11-24T05:44:31Z)
Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model [93.9943278892735]
Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences. We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM) Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
arXiv Detail & Related papers (2021-10-29T04:01:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.