Learned Token Pruning for Transformers
- URL: http://arxiv.org/abs/2107.00910v1
- Date: Fri, 2 Jul 2021 09:00:13 GMT
- Title: Learned Token Pruning for Transformers
- Authors: Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Joseph Hassoun,
Kurt Keutzer
- Abstract summary: Learned Token Pruning () method reduces redundant tokens as the data passes through the different layers of a transformer.
We extensively test the performance of our approach on multiple GLUE tasks.
Preliminary results show up to 1.4x and 1.9x throughput improvement on Tesla T4 and Intel Haswell.
- Score: 39.181816379061374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A major challenge in deploying transformer models is their prohibitive
inference cost, which quadratically scales with the input sequence length. This
makes it especially difficult to use transformers for processing long
sequences. To address this, we present a novel Learned Token Pruning (LTP)
method that reduces redundant tokens as the data passes through the different
layers of the transformer. In particular, LTP prunes tokens with an attention
score below a threshold value, which is learned during training. Importantly,
our threshold based method avoids algorithmically expensive operations such as
top-k token selection which are used in prior token pruning methods, and also
leads to structured pruning. We extensively test the performance of our
approach on multiple GLUE tasks and show that our learned threshold based
method consistently outperforms the prior state-of-the-art top-k token based
method by up to ~2% higher accuracy with the same amount of FLOPs. Furthermore,
our preliminary results show up to 1.4x and 1.9x throughput improvement on
Tesla T4 GPU and Intel Haswell CPU, respectively, with less than 1% of accuracy
drop (and up to 2.1x FLOPs reduction). Our code has been developed in PyTorch
and has been open-sourced.
Related papers
- Learned Thresholds Token Merging and Pruning for Vision Transformers [5.141687309207561]
This paper introduces Learned Thresholds token Merging and Pruning (LTMP), a novel approach that leverages the strengths of both token merging and token pruning.
We demonstrate our approach with extensive experiments on vision transformers on the ImageNet classification task.
arXiv Detail & Related papers (2023-07-20T11:30:12Z) - Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing
Important Tokens [65.4435926060951]
We propose to significantly improve the efficiency of Transformers for ultra long sequences, by compressing the sequence into a much smaller representation at each layer.
Our algorithm is not only efficient (achieving more than $3times$ efficiency gain compared to baselines on 4K and 16K lengths) but also offers competitive/better performance on a large number of tasks.
arXiv Detail & Related papers (2023-05-07T10:32:18Z) - Joint Token Pruning and Squeezing Towards More Aggressive Compression of
Vision Transformers [2.0442992958844517]
We propose a novel Token Pruning & Squeezing module (TPS) for compressing vision transformers with higher efficiency.
TPS squeezes the information of pruned tokens into partial reserved tokens via the unidirectional nearest-neighbor matching and similarity-based fusing steps.
Our method can accelerate the throughput of DeiT-small beyond DeiT-tiny, while its accuracy surpasses DeiT-tiny by 4.78%.
arXiv Detail & Related papers (2023-04-21T02:59:30Z) - Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully
Exploiting Self-Attention [36.90363317158731]
We propose an adaptive sparse token pruning framework with a minimal cost.
Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy.
arXiv Detail & Related papers (2022-09-28T03:07:32Z) - Transkimmer: Transformer Learns to Layer-wise Skim [17.188613474427054]
One of the major computational inefficiency of Transformer-based models is that they spend identical amount of computation throughout all layers.
We propose Transkimmer architecture, which learns to identify hidden state tokens that are not required by each layer.
The skimmed tokens are then forwarded directly to the final output, thus reducing the computation of the successive layers.
arXiv Detail & Related papers (2022-05-15T16:23:30Z) - A Fast Post-Training Pruning Framework for Transformers [74.59556951906468]
Pruning is an effective way to reduce the huge inference cost of large Transformer models.
Prior work on model pruning requires retraining the model.
We propose a fast post-training pruning framework for Transformers that does not require any retraining.
arXiv Detail & Related papers (2022-03-29T07:41:11Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - DoT: An efficient Double Transformer for NLP tasks with tables [3.0079490585515343]
DoT is a double transformer model that decomposes the problem into two sub-tasks.
We show that for a small drop of accuracy, DoT improves training and inference time by at least 50%.
arXiv Detail & Related papers (2021-06-01T13:33:53Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z) - Funnel-Transformer: Filtering out Sequential Redundancy for Efficient
Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one.
With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.