Fastformer: Additive Attention Can Be All You Need
- URL: http://arxiv.org/abs/2108.09084v2
- Date: Mon, 23 Aug 2021 13:11:51 GMT
- Title: Fastformer: Additive Attention Can Be All You Need
- Authors: Chuhan Wu, Fangzhao Wu, Tao Qi, Yongfeng Huang
- Abstract summary: We propose Fastformer, which is an efficient Transformer model based on additive attention.
In Fastformer, instead of modeling the pair-wise interactions between tokens, we first use additive attention mechanism to model global contexts.
In this way, Fastformer can achieve effective context modeling with linear complexity.
- Score: 51.79399904527525
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Transformer is a powerful model for text understanding. However, it is
inefficient due to its quadratic complexity to input sequence length. Although
there are many methods on Transformer acceleration, they are still either
inefficient on long sequences or not effective enough. In this paper, we
propose Fastformer, which is an efficient Transformer model based on additive
attention. In Fastformer, instead of modeling the pair-wise interactions
between tokens, we first use additive attention mechanism to model global
contexts, and then further transform each token representation based on its
interaction with global context representations. In this way, Fastformer can
achieve effective context modeling with linear complexity. Extensive
experiments on five datasets show that Fastformer is much more efficient than
many existing Transformer models and can meanwhile achieve comparable or even
better long text modeling performance.
Related papers
- Fast-StrucTexT: An Efficient Hourglass Transformer with Modality-guided
Dynamic Token Merge for Document Understanding [40.322453628755376]
General efficient transformers are challenging to be directly adapted to model document.
Fast-StrucTexT is an efficient multi-modal framework based on the StrucTexT algorithm with an hourglass transformer architecture.
Our model achieves the state-of-the-art performance and almost 1.9X faster inference time than the state-of-the-art methods.
arXiv Detail & Related papers (2023-05-19T02:42:35Z) - Linearizing Transformer with Key-Value Memory Bank [54.83663647680612]
We propose MemSizer, an approach to project the source sequence into lower dimension representation.
MemSizer not only achieves the same linear time complexity but also enjoys efficient recurrent-style autoregressive generation.
We demonstrate that MemSizer provides an improved tradeoff between efficiency and accuracy over the vanilla transformer.
arXiv Detail & Related papers (2022-03-23T18:10:18Z) - Transformer-F: A Transformer network with effective methods for learning
universal sentence representation [8.225067988604351]
The Transformer model is widely used in natural language processing for sentence representation.
In this paper, two approaches are introduced to improve the performance of Transformers.
arXiv Detail & Related papers (2021-07-02T03:20:11Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - GroupBERT: Enhanced Transformer Architecture with Efficient Grouped
Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture.
We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions.
We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z) - FNet: Mixing Tokens with Fourier Transforms [0.578717214982749]
We show that Transformer encoder architectures can be massively sped up with limited accuracy costs.
We replace the self-attention sublayers with simple linear transformations that "mix" input tokens.
The resulting model, which we name FNet, scales very efficiently to long inputs.
arXiv Detail & Related papers (2021-05-09T03:32:48Z) - Shortformer: Better Language Modeling using Shorter Inputs [62.51758040848735]
We show that initially training the model on short subsequences, before moving on to longer ones, both reduces overall training time.
We then show how to improve the efficiency of recurrence methods in transformers.
arXiv Detail & Related papers (2020-12-31T18:52:59Z) - Addressing Some Limitations of Transformers with Feedback Memory [51.94640029417114]
Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks.
We propose the Feedback Transformer architecture that exposes all previous representations to all future representations.
We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.
arXiv Detail & Related papers (2020-02-21T16:37:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.