Shortformer: Better Language Modeling using Shorter Inputs
- URL: http://arxiv.org/abs/2012.15832v1
- Date: Thu, 31 Dec 2020 18:52:59 GMT
- Title: Shortformer: Better Language Modeling using Shorter Inputs
- Authors: Ofir Press, Noah A. Smith, Mike Lewis
- Abstract summary: We show that initially training the model on short subsequences, before moving on to longer ones, both reduces overall training time.
We then show how to improve the efficiency of recurrence methods in transformers.
- Score: 62.51758040848735
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We explore the benefits of decreasing the input length of transformers.
First, we show that initially training the model on short subsequences, before
moving on to longer ones, both reduces overall training time and, surprisingly,
gives a large improvement in perplexity. We then show how to improve the
efficiency of recurrence methods in transformers, which let models condition on
previously processed tokens (when generating sequences that are larger than the
maximal length that the transformer can handle at once). Existing methods
require computationally expensive relative position embeddings; we introduce a
simple alternative of adding absolute position embeddings to queries and keys
instead of to word embeddings, which efficiently produces superior results. By
combining these techniques, we increase training speed by 65%, make generation
nine times faster, and substantially improve perplexity on WikiText-103,
without adding any parameters.
Related papers
- TAPIR: Learning Adaptive Revision for Incremental Natural Language
Understanding with a Two-Pass Model [14.846377138993645]
Recent neural network-based approaches for incremental processing mainly use RNNs or Transformers.
A restart-incremental interface that repeatedly passes longer input prefixes can be used to obtain partial outputs, while providing the ability to revise.
We propose the Two-pass model for AdaPtIve Revision (TAPIR) and introduce a method to obtain an incremental supervision signal for learning an adaptive revision policy.
arXiv Detail & Related papers (2023-05-18T09:58:19Z) - Linearizing Transformer with Key-Value Memory Bank [54.83663647680612]
We propose MemSizer, an approach to project the source sequence into lower dimension representation.
MemSizer not only achieves the same linear time complexity but also enjoys efficient recurrent-style autoregressive generation.
We demonstrate that MemSizer provides an improved tradeoff between efficiency and accuracy over the vanilla transformer.
arXiv Detail & Related papers (2022-03-23T18:10:18Z) - Towards Incremental Transformers: An Empirical Analysis of Transformer Models for Incremental NLU [19.103130032967663]
Incremental processing allows interactive systems to respond based on partial inputs.
Recent work attempts to apply Transformers incrementally via restart-incrementality.
This approach is computationally costly and does not scale efficiently for long sequences.
arXiv Detail & Related papers (2021-09-15T15:20:29Z) - Fastformer: Additive Attention Can Be All You Need [51.79399904527525]
We propose Fastformer, which is an efficient Transformer model based on additive attention.
In Fastformer, instead of modeling the pair-wise interactions between tokens, we first use additive attention mechanism to model global contexts.
In this way, Fastformer can achieve effective context modeling with linear complexity.
arXiv Detail & Related papers (2021-08-20T09:44:44Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z) - Funnel-Transformer: Filtering out Sequential Redundancy for Efficient
Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one.
With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z) - The Cascade Transformer: an Application for Efficient Answer Sentence
Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers.
When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.