DeFormer: Decomposing Pre-trained Transformers for Faster Question
Answering
- URL: http://arxiv.org/abs/2005.00697v1
- Date: Sat, 2 May 2020 04:28:22 GMT
- Title: DeFormer: Decomposing Pre-trained Transformers for Faster Question
Answering
- Authors: Qingqing Cao, Harsh Trivedi, Aruna Balasubramanian, Niranjan
Balasubramanian
- Abstract summary: Transformer-based QA models use input-wide self-attention across both the question and the input passage.
We introduce DeFormer, which substitutes the full self-attention with question-wide and passage-wide self-attentions in the lower layers.
We show DeFormer versions of BERT and XLNet can be used to speed up QA by over 4.3x and with simple distillation-based losses they incur only a 1% drop in accuracy.
- Score: 22.178201429268103
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based QA models use input-wide self-attention -- i.e. across both
the question and the input passage -- at all layers, causing them to be slow
and memory-intensive. It turns out that we can get by without input-wide
self-attention at all layers, especially in the lower layers. We introduce
DeFormer, a decomposed transformer, which substitutes the full self-attention
with question-wide and passage-wide self-attentions in the lower layers. This
allows for question-independent processing of the input text representations,
which in turn enables pre-computing passage representations reducing runtime
compute drastically. Furthermore, because DeFormer is largely similar to the
original model, we can initialize DeFormer with the pre-training weights of a
standard transformer, and directly fine-tune on the target QA dataset. We show
DeFormer versions of BERT and XLNet can be used to speed up QA by over 4.3x and
with simple distillation-based losses they incur only a 1% drop in accuracy. We
open source the code at https://github.com/StonyBrookNLP/deformer.
Related papers
- Value Residual Learning For Alleviating Attention Concentration In Transformers [14.898656879574622]
stacking multiple attention layers leads to attention concentration.
One natural way to address this issue is to use cross-layer attention, allowing information from earlier layers to be directly accessible to later layers.
We propose Transformer with residual value (ResFormer) which approximates cross-layer attention through adding a residual connection from the values of the the first layer to all subsequent layers.
arXiv Detail & Related papers (2024-10-23T14:15:07Z) - IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs [8.830921747658925]
One limitation of existing Transformer-based models is that they cannot handle very long sequences as input.
We propose a novel method for accelerating self-attention at inference time.
We demonstrate a greater speedup of 2.73x - 7.63x while retaining 98.6% - 99.6% of the accuracy of the original pretrained models.
arXiv Detail & Related papers (2024-05-05T08:18:42Z) - MLP Can Be A Good Transformer Learner [73.01739251050076]
Self-attention mechanism is the key of the Transformer but often criticized for its computation demands.
This paper introduces a novel strategy that simplifies vision transformers and reduces computational load through the selective removal of non-essential attention layers.
arXiv Detail & Related papers (2024-04-08T16:40:15Z) - Unlimiformer: Long-Range Transformers with Unlimited Length Input [67.04942180004805]
Unlimiformer is a general approach that wraps any existing pretrained encoder-decoder transformer.
It offloads the cross-attention computation to a single k-nearest-neighbor (kNN) index.
We show that Unlimiformer can process even 500k token-long inputs from the BookSum dataset, without any input truncation at test time.
arXiv Detail & Related papers (2023-05-02T17:35:08Z) - How Much Does Attention Actually Attend? Questioning the Importance of
Attention in Pretrained Transformers [59.57128476584361]
We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones.
We find that without any input-dependent attention, all models achieve competitive performance.
We show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success.
arXiv Detail & Related papers (2022-11-07T12:37:54Z) - A Fast Post-Training Pruning Framework for Transformers [74.59556951906468]
Pruning is an effective way to reduce the huge inference cost of large Transformer models.
Prior work on model pruning requires retraining the model.
We propose a fast post-training pruning framework for Transformers that does not require any retraining.
arXiv Detail & Related papers (2022-03-29T07:41:11Z) - Block-Skim: Efficient Question Answering for Transformer [25.429122678247452]
We propose Block-Skim, which learns to skim unnecessary context in higher hidden layers to improve and accelerate the Transformer performance.
We further prune the hidden states corresponding to the unnecessary positions early in lower layers, achieving significant inference-time speedup.
Block-Skim improves QA models' accuracy on different datasets and achieves 3 times speedup on BERT-base model.
arXiv Detail & Related papers (2021-12-16T01:45:33Z) - Memory-efficient Transformers via Top-$k$ Attention [23.672065688109395]
In this work, we propose a simple yet highly accurate approximation for vanilla attention.
We process the queries in chunks, and for each query, compute the top-$k$ scores with respect to the keys.
We show our approach leads to accuracy that is nearly-identical to vanilla attention in multiple setups including training from scratch, fine-tuning, and zero-shot inference.
arXiv Detail & Related papers (2021-06-13T02:30:23Z) - Video Super-Resolution Transformer [85.11270760456826]
Video super-resolution (VSR), with the aim to restore a high-resolution video from its corresponding low-resolution version, is a spatial-temporal sequence prediction problem.
Recently, Transformer has been gaining popularity due to its parallel computing ability for sequence-to-sequence modeling.
In this paper, we present a spatial-temporal convolutional self-attention layer with a theoretical understanding to exploit the locality information.
arXiv Detail & Related papers (2021-06-12T20:00:32Z) - FNet: Mixing Tokens with Fourier Transforms [0.578717214982749]
We show that Transformer encoder architectures can be massively sped up with limited accuracy costs.
We replace the self-attention sublayers with simple linear transformations that "mix" input tokens.
The resulting model, which we name FNet, scales very efficiently to long inputs.
arXiv Detail & Related papers (2021-05-09T03:32:48Z) - The Cascade Transformer: an Application for Efficient Answer Sentence
Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers.
When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.