Finetuning Pretrained Transformers into RNNs
- URL: http://arxiv.org/abs/2103.13076v1
- Date: Wed, 24 Mar 2021 10:50:43 GMT
- Title: Finetuning Pretrained Transformers into RNNs
- Authors: Jungo Kasai, Hao Peng, Yizhe Zhang, Dani Yogatama, Gabriel Ilharco,
Nikolaos Pappas, Yi Mao, Weizhu Chen, Noah A. Smith
- Abstract summary: Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
- Score: 81.72974646901136
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers have outperformed recurrent neural networks (RNNs) in natural
language generation. This comes with a significant computational overhead, as
the attention mechanism scales with a quadratic complexity in sequence length.
Efficient transformer variants have received increasing interest from recent
works. Among them, a linear-complexity recurrent variant has proven well suited
for autoregressive generation. It approximates the softmax attention with
randomized or heuristic feature maps, but can be difficult to train or yield
suboptimal accuracy. This work aims to convert a pretrained transformer into
its efficient recurrent counterpart, improving the efficiency while retaining
the accuracy. Specifically, we propose a swap-then-finetune procedure: in an
off-the-shelf pretrained transformer, we replace the softmax attention with its
linear-complexity recurrent alternative and then finetune. With a learned
feature map, our approach provides an improved tradeoff between efficiency and
accuracy over the standard transformer and other recurrent variants. We also
show that the finetuning process needs lower training cost than training these
recurrent variants from scratch. As many recent models for natural language
tasks are increasingly dependent on large-scale pretrained transformers, this
work presents a viable approach to improving inference efficiency without
repeating the expensive pretraining process.
Related papers
- SPION: Layer-Wise Sparse Training of Transformer via Convolutional Flood
Filling [1.0128808054306186]
We propose a novel sparsification scheme for the Transformer that integrates convolution filters and the flood filling method.
Our sparsification approach reduces the computational complexity and memory footprint of the Transformer during training.
New SPION achieves up to 3.08X speedup over existing state-of-the-art sparse Transformer models.
arXiv Detail & Related papers (2023-09-22T02:14:46Z) - Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - Momentum Transformer: Closing the Performance Gap Between Self-attention
and Its Linearization [31.28396970291575]
Leveraging techniques include sparse and linear attention and hashing tricks; efficient transformers have been proposed to reduce the quadratic complexity of transformers but significantly degrade the accuracy.
We first interpret the linear attention and residual connections in computing the attention map as gradient descent steps.
We then introduce momentum into these components and propose the emphmomentum transformer, which utilizes momentum to improve the accuracy of linear transformers while maintaining linear memory and computational complexities.
arXiv Detail & Related papers (2022-08-01T02:37:49Z) - Linearizing Transformer with Key-Value Memory Bank [54.83663647680612]
We propose MemSizer, an approach to project the source sequence into lower dimension representation.
MemSizer not only achieves the same linear time complexity but also enjoys efficient recurrent-style autoregressive generation.
We demonstrate that MemSizer provides an improved tradeoff between efficiency and accuracy over the vanilla transformer.
arXiv Detail & Related papers (2022-03-23T18:10:18Z) - Towards Incremental Transformers: An Empirical Analysis of Transformer Models for Incremental NLU [19.103130032967663]
Incremental processing allows interactive systems to respond based on partial inputs.
Recent work attempts to apply Transformers incrementally via restart-incrementality.
This approach is computationally costly and does not scale efficiently for long sequences.
arXiv Detail & Related papers (2021-09-15T15:20:29Z) - Shortformer: Better Language Modeling using Shorter Inputs [62.51758040848735]
We show that initially training the model on short subsequences, before moving on to longer ones, both reduces overall training time.
We then show how to improve the efficiency of recurrence methods in transformers.
arXiv Detail & Related papers (2020-12-31T18:52:59Z) - The Cascade Transformer: an Application for Efficient Answer Sentence
Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers.
When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.