Fourier Transformer: Fast Long Range Modeling by Removing Sequence
Redundancy with FFT Operator
- URL: http://arxiv.org/abs/2305.15099v1
- Date: Wed, 24 May 2023 12:33:06 GMT
- Title: Fourier Transformer: Fast Long Range Modeling by Removing Sequence
Redundancy with FFT Operator
- Authors: Ziwei He, Meng Yang, Minwei Feng, Jingcheng Yin, Xinbing Wang, Jingwen
Leng, Zhouhan Lin
- Abstract summary: Fourier Transformer is able to significantly reduce computational costs while retain the ability to inherit from various large pretrained models.
Our model achieves state-of-the-art performances among all transformer-based models on the long-range modeling benchmark LRA.
For generative seq-to-seq tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our model outperforms the standard BART.
- Score: 24.690247474891958
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The transformer model is known to be computationally demanding, and
prohibitively costly for long sequences, as the self-attention module uses a
quadratic time and space complexity with respect to sequence length. Many
researchers have focused on designing new forms of self-attention or
introducing new parameters to overcome this limitation, however a large portion
of them prohibits the model to inherit weights from large pretrained models. In
this work, the transformer's inefficiency has been taken care of from another
perspective. We propose Fourier Transformer, a simple yet effective approach by
progressively removing redundancies in hidden sequence using the ready-made
Fast Fourier Transform (FFT) operator to perform Discrete Cosine Transformation
(DCT). Fourier Transformer is able to significantly reduce computational costs
while retain the ability to inherit from various large pretrained models.
Experiments show that our model achieves state-of-the-art performances among
all transformer-based models on the long-range modeling benchmark LRA with
significant improvement in both speed and space. For generative seq-to-seq
tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our
model outperforms the standard BART and other efficient models. \footnote{Our
code is publicly available at
\url{https://github.com/LUMIA-Group/FourierTransformer}}
Related papers
- Utilizing Image Transforms and Diffusion Models for Generative Modeling of Short and Long Time Series [7.201938834736084]
We propose a unified generative model for varying-length time series.
We employ invertible transforms such as the delay embedding and the short-time Fourier transform.
We show that our approach achieves consistently state-of-the-art results against strong baselines.
arXiv Detail & Related papers (2024-10-25T13:06:18Z) - PRformer: Pyramidal Recurrent Transformer for Multivariate Time Series Forecasting [82.03373838627606]
Self-attention mechanism in Transformer architecture requires positional embeddings to encode temporal order in time series prediction.
We argue that this reliance on positional embeddings restricts the Transformer's ability to effectively represent temporal sequences.
We present a model integrating PRE with a standard Transformer encoder, demonstrating state-of-the-art performance on various real-world datasets.
arXiv Detail & Related papers (2024-08-20T01:56:07Z) - MatFormer: Nested Transformer for Elastic Inference [94.1789252941718]
MatFormer is a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints.
We show that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B.
We also observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval.
arXiv Detail & Related papers (2023-10-11T17:57:14Z) - Robust representations of oil wells' intervals via sparse attention
mechanism [2.604557228169423]
We introduce the class of efficient Transformers named Regularized Transformers (Reguformers)
The focus in our experiments is on oil&gas data, namely, well logs.
To evaluate our models for such problems, we work with an industry-scale open dataset consisting of well logs of more than 20 wells.
arXiv Detail & Related papers (2022-12-29T09:56:33Z) - Fast-FNet: Accelerating Transformer Encoder Models via Efficient Fourier
Layers [0.0]
Transformer-based language models utilize the attention mechanism for substantial performance improvements in almost all natural language processing (NLP) tasks.
Recent works focused on eliminating the disadvantages of computational inefficiency and showed that transformer-based models can still reach competitive results without the attention layer.
A pioneering study proposed the FNet, which replaces the attention layer with the Fourier Transform (FT) in the transformer encoder architecture.
arXiv Detail & Related papers (2022-09-26T16:23:02Z) - FNetAR: Mixing Tokens with Autoregressive Fourier Transforms [0.0]
We show that FNetAR retains state-of-the-art performance (25.8 ppl) on the task of causal language modeling.
The autoregressive Fourier transform could likely be used for parameter on most Transformer-based time-series prediction models.
arXiv Detail & Related papers (2021-07-22T21:24:02Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - FNet: Mixing Tokens with Fourier Transforms [0.578717214982749]
We show that Transformer encoder architectures can be massively sped up with limited accuracy costs.
We replace the self-attention sublayers with simple linear transformations that "mix" input tokens.
The resulting model, which we name FNet, scales very efficiently to long inputs.
arXiv Detail & Related papers (2021-05-09T03:32:48Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z) - Long Range Arena: A Benchmark for Efficient Transformers [115.1654897514089]
Long-rangearena benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens.
We systematically evaluate ten well-established long-range Transformer models on our newly proposed benchmark suite.
arXiv Detail & Related papers (2020-11-08T15:53:56Z) - Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute.
We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.