FNet: Mixing Tokens with Fourier Transforms
- URL: http://arxiv.org/abs/2105.03824v1
- Date: Sun, 9 May 2021 03:32:48 GMT
- Title: FNet: Mixing Tokens with Fourier Transforms
- Authors: James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon
- Abstract summary: We show that Transformer encoder architectures can be massively sped up with limited accuracy costs.
We replace the self-attention sublayers with simple linear transformations that "mix" input tokens.
The resulting model, which we name FNet, scales very efficiently to long inputs.
- Score: 0.578717214982749
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We show that Transformer encoder architectures can be massively sped up, with
limited accuracy costs, by replacing the self-attention sublayers with simple
linear transformations that "mix" input tokens. These linear transformations,
along with simple nonlinearities in feed-forward layers, are sufficient to
model semantic relationships in several text classification tasks. Perhaps most
surprisingly, we find that replacing the self-attention sublayer in a
Transformer encoder with a standard, unparameterized Fourier Transform achieves
92% of the accuracy of BERT on the GLUE benchmark, but pre-trains and runs up
to seven times faster on GPUs and twice as fast on TPUs. The resulting model,
which we name FNet, scales very efficiently to long inputs, matching the
accuracy of the most accurate "efficient" Transformers on the Long Range Arena
benchmark, but training and running faster across all sequence lengths on GPUs
and relatively shorter sequence lengths on TPUs. Finally, FNet has a light
memory footprint and is particularly efficient at smaller model sizes: for a
fixed speed and accuracy budget, small FNet models outperform Transformer
counterparts.
Related papers
- Fourier Transformer: Fast Long Range Modeling by Removing Sequence
Redundancy with FFT Operator [24.690247474891958]
Fourier Transformer is able to significantly reduce computational costs while retain the ability to inherit from various large pretrained models.
Our model achieves state-of-the-art performances among all transformer-based models on the long-range modeling benchmark LRA.
For generative seq-to-seq tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our model outperforms the standard BART.
arXiv Detail & Related papers (2023-05-24T12:33:06Z) - Block-Recurrent Transformers [49.07682696216708]
We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashion along a sequence.
Our recurrent cell operates on blocks of tokens rather than single tokens, and leverages parallel computation within a block in order to make efficient use of accelerator hardware.
arXiv Detail & Related papers (2022-03-11T23:44:33Z) - Sparse is Enough in Scaling Transformers [12.561317511514469]
Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach.
We propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer.
arXiv Detail & Related papers (2021-11-24T19:53:46Z) - PoNet: Pooling Network for Efficient Token Mixing in Long Sequences [34.657602765639375]
We propose a novel Pooling Network (PoNet) for token mixing in long sequences with linear complexity.
On the Long Range Arena benchmark, PoNet significantly outperforms Transformer and achieves competitive accuracy.
arXiv Detail & Related papers (2021-10-06T01:07:54Z) - Fastformer: Additive Attention Can Be All You Need [51.79399904527525]
We propose Fastformer, which is an efficient Transformer model based on additive attention.
In Fastformer, instead of modeling the pair-wise interactions between tokens, we first use additive attention mechanism to model global contexts.
In this way, Fastformer can achieve effective context modeling with linear complexity.
arXiv Detail & Related papers (2021-08-20T09:44:44Z) - FNetAR: Mixing Tokens with Autoregressive Fourier Transforms [0.0]
We show that FNetAR retains state-of-the-art performance (25.8 ppl) on the task of causal language modeling.
The autoregressive Fourier transform could likely be used for parameter on most Transformer-based time-series prediction models.
arXiv Detail & Related papers (2021-07-22T21:24:02Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z) - Shortformer: Better Language Modeling using Shorter Inputs [62.51758040848735]
We show that initially training the model on short subsequences, before moving on to longer ones, both reduces overall training time.
We then show how to improve the efficiency of recurrence methods in transformers.
arXiv Detail & Related papers (2020-12-31T18:52:59Z) - Long Range Arena: A Benchmark for Efficient Transformers [115.1654897514089]
Long-rangearena benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens.
We systematically evaluate ten well-established long-range Transformer models on our newly proposed benchmark suite.
arXiv Detail & Related papers (2020-11-08T15:53:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.