Transkimmer: Transformer Learns to Layer-wise Skim
- URL: http://arxiv.org/abs/2205.07324v1
- Date: Sun, 15 May 2022 16:23:30 GMT
- Title: Transkimmer: Transformer Learns to Layer-wise Skim
- Authors: Yue Guan, Zhengyi Li, Jingwen Leng, Zhouhan Lin, Minyi Guo
- Abstract summary: One of the major computational inefficiency of Transformer-based models is that they spend identical amount of computation throughout all layers.
We propose Transkimmer architecture, which learns to identify hidden state tokens that are not required by each layer.
The skimmed tokens are then forwarded directly to the final output, thus reducing the computation of the successive layers.
- Score: 17.188613474427054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer architecture has become the de-facto model for many machine
learning tasks from natural language processing and computer vision. As such,
improving its computational efficiency becomes paramount. One of the major
computational inefficiency of Transformer-based models is that they spend the
identical amount of computation throughout all layers. Prior works have
proposed to augment the Transformer model with the capability of skimming
tokens to improve its computational efficiency. However, they suffer from not
having effectual and end-to-end optimization of the discrete skimming
predictor. To address the above limitations, we propose the Transkimmer
architecture, which learns to identify hidden state tokens that are not
required by each layer. The skimmed tokens are then forwarded directly to the
final output, thus reducing the computation of the successive layers. The key
idea in Transkimmer is to add a parameterized predictor before each layer that
learns to make the skimming decision. We also propose to adopt
reparameterization trick and add skim loss for the end-to-end training of
Transkimmer. Transkimmer achieves 10.97x average speedup on GLUE benchmark
compared with vanilla BERT-base baseline with less than 1% accuracy
degradation.
Related papers
- CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer
Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions.
We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z) - Block-Skim: Efficient Question Answering for Transformer [25.429122678247452]
We propose Block-Skim, which learns to skim unnecessary context in higher hidden layers to improve and accelerate the Transformer performance.
We further prune the hidden states corresponding to the unnecessary positions early in lower layers, achieving significant inference-time speedup.
Block-Skim improves QA models' accuracy on different datasets and achieves 3 times speedup on BERT-base model.
arXiv Detail & Related papers (2021-12-16T01:45:33Z) - Token Pooling in Vision Transformers [37.11990688046186]
In vision transformers, self-attention is not the major bottleneck, e.g., more than 80% of the computation is spent on fully-connected layers.
We propose a novel token downsampling method, called Token Pooling, efficiently exploiting redundancies in the images and intermediate token representations.
Our experiments show that Token Pooling significantly improves the cost-accuracy trade-off over the state-of-the-art downsampling.
arXiv Detail & Related papers (2021-10-08T02:22:50Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z) - GradInit: Learning to Initialize Neural Networks for Stable and
Efficient Training [59.160154997555956]
We present GradInit, an automated and architecture method for initializing neural networks.
It is based on a simple agnostic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value.
It also enables training the original Post-LN Transformer for machine translation without learning rate warmup.
arXiv Detail & Related papers (2021-02-16T11:45:35Z) - Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime
with Search [84.94597821711808]
We extend PoWER-BERT (Goyal et al., 2020) and propose Length-Adaptive Transformer that can be used for various inference scenarios after one-shot training.
We conduct a multi-objective evolutionary search to find a length configuration that maximizes the accuracy and minimizes the efficiency metric under any given computational budget.
We empirically verify the utility of the proposed approach by demonstrating the superior accuracy-efficiency trade-off under various setups.
arXiv Detail & Related papers (2020-10-14T12:28:08Z) - Funnel-Transformer: Filtering out Sequential Redundancy for Efficient
Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one.
With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.