Fine- and Coarse-Granularity Hybrid Self-Attention for Efficient BERT
- URL: http://arxiv.org/abs/2203.09055v1
- Date: Thu, 17 Mar 2022 03:33:47 GMT
- Title: Fine- and Coarse-Granularity Hybrid Self-Attention for Efficient BERT
- Authors: Jing Zhao, Yifan Wang, Junwei Bao, Youzheng Wu, Xiaodong He
- Abstract summary: We propose a fine- and coarse-granularity hybrid self-attention that reduces the cost through progressively shortening the computational sequence length in self-attention.
We show that FCA offers a significantly better trade-off between accuracy and FLOPs compared to prior methods.
- Score: 22.904252855587348
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Transformer-based pre-trained models, such as BERT, have shown extraordinary
success in achieving state-of-the-art results in many natural language
processing applications. However, deploying these models can be prohibitively
costly, as the standard self-attention mechanism of the Transformer suffers
from quadratic computational cost in the input sequence length. To confront
this, we propose FCA, a fine- and coarse-granularity hybrid self-attention that
reduces the computation cost through progressively shortening the computational
sequence length in self-attention. Specifically, FCA conducts an
attention-based scoring strategy to determine the informativeness of tokens at
each layer. Then, the informative tokens serve as the fine-granularity
computing units in self-attention and the uninformative tokens are replaced
with one or several clusters as the coarse-granularity computing units in
self-attention. Experiments on GLUE and RACE datasets show that BERT with FCA
achieves 2x reduction in FLOPs over original BERT with <1% loss in accuracy. We
show that FCA offers a significantly better trade-off between accuracy and
FLOPs compared to prior methods.
Related papers
- Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification [6.660834045805309]
Pre-trained transformers such as BERT suffer from a computationally expensive self-attention mechanism.
We propose integrating two strategies: token pruning and token combining.
Experiments with various datasets demonstrate superior performance compared to baseline models.
arXiv Detail & Related papers (2024-06-03T12:51:52Z) - TPC-ViT: Token Propagation Controller for Efficient Vision Transformer [6.341420717393898]
Vision transformers (ViTs) have achieved promising results on a variety of Computer Vision tasks.
Previous approaches that employ gradual token reduction to address this challenge assume that token redundancy in one layer implies redundancy in all the following layers.
We propose a novel token propagation controller (TPC) that incorporates two different token-distributions.
arXiv Detail & Related papers (2024-01-03T00:10:33Z) - DPBERT: Efficient Inference for BERT based on Dynamic Planning [11.680840266488884]
Existing input-adaptive inference methods fail to take full advantage of the structure of BERT.
We propose Dynamic Planning in BERT, a novel fine-tuning strategy that can accelerate the inference process of BERT.
Our method reduces latency to 75% while maintaining 98% accuracy, yielding a better accuracy-speed trade-off compared to state-of-the-art input-adaptive methods.
arXiv Detail & Related papers (2023-07-26T07:18:50Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - Constraint-aware and Ranking-distilled Token Pruning for Efficient
Transformer Inference [18.308180927492643]
ToP is a ranking-distilled token distillation technique, which distills effective token rankings from the final layer of unpruned models to early layers of pruned models.
ToP reduces the average FLOPs of BERT by 8.1x while achieving competitive accuracy on GLUE, and provides a real latency speedup of up to 7.4x on an Intel CPU.
arXiv Detail & Related papers (2023-06-26T03:06:57Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully
Exploiting Self-Attention [36.90363317158731]
We propose an adaptive sparse token pruning framework with a minimal cost.
Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy.
arXiv Detail & Related papers (2022-09-28T03:07:32Z) - Outlier Suppression: Pushing the Limit of Low-bit Transformer Language
Models [57.933500846742234]
Recent work recognizes that structured outliers are the critical bottleneck for quantization performance.
We propose an outlier suppression framework including two components: Gamma Migration and Token-Wise Clipping.
This framework effectively suppresses the outliers and can be used in a plug-and-play mode.
arXiv Detail & Related papers (2022-09-27T12:05:59Z) - BiBERT: Accurate Fully Binarized BERT [69.35727280997617]
BiBERT is an accurate fully binarized BERT to eliminate the performance bottlenecks.
Our method yields impressive 56.3 times and 31.2 times saving on FLOPs and model size.
arXiv Detail & Related papers (2022-03-12T09:46:13Z) - Adaptive Fourier Neural Operators: Efficient Token Mixers for
Transformers [55.90468016961356]
We propose an efficient token mixer that learns to mix in the Fourier domain.
AFNO is based on a principled foundation of operator learning.
It can handle a sequence size of 65k and outperforms other efficient self-attention mechanisms.
arXiv Detail & Related papers (2021-11-24T05:44:31Z) - Pre-training Co-evolutionary Protein Representation via A Pairwise
Masked Language Model [93.9943278892735]
Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences.
We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM)
Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
arXiv Detail & Related papers (2021-10-29T04:01:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.