FNetAR: Mixing Tokens with Autoregressive Fourier Transforms
- URL: http://arxiv.org/abs/2107.10932v1
- Date: Thu, 22 Jul 2021 21:24:02 GMT
- Title: FNetAR: Mixing Tokens with Autoregressive Fourier Transforms
- Authors: Tim Lou, Michael Park, Mohammad Ramezanali, Vincent Tang
- Abstract summary: We show that FNetAR retains state-of-the-art performance (25.8 ppl) on the task of causal language modeling.
The autoregressive Fourier transform could likely be used for parameter on most Transformer-based time-series prediction models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this note we examine the autoregressive generalization of the FNet
algorithm, in which self-attention layers from the standard Transformer
architecture are substituted with a trivial sparse-uniformsampling procedure
based on Fourier transforms. Using the Wikitext-103 benchmark, we
demonstratethat FNetAR retains state-of-the-art performance (25.8 ppl) on the
task of causal language modelingcompared to a Transformer-XL baseline (24.2
ppl) with only half the number self-attention layers,thus providing further
evidence for the superfluity of deep neural networks with heavily
compoundedattention mechanisms. The autoregressive Fourier transform could
likely be used for parameterreduction on most Transformer-based time-series
prediction models.
Related papers
- Differential Transformer [99.5117269150629]
Transformer tends to overallocate attention to irrelevant context.
We introduce Diff Transformer, which amplifies attention to relevant context while canceling noise.
It offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers.
arXiv Detail & Related papers (2024-10-07T17:57:38Z) - PRformer: Pyramidal Recurrent Transformer for Multivariate Time Series Forecasting [82.03373838627606]
Self-attention mechanism in Transformer architecture requires positional embeddings to encode temporal order in time series prediction.
We argue that this reliance on positional embeddings restricts the Transformer's ability to effectively represent temporal sequences.
We present a model integrating PRE with a standard Transformer encoder, demonstrating state-of-the-art performance on various real-world datasets.
arXiv Detail & Related papers (2024-08-20T01:56:07Z) - iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions.
The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z) - Fourier Transformer: Fast Long Range Modeling by Removing Sequence
Redundancy with FFT Operator [24.690247474891958]
Fourier Transformer is able to significantly reduce computational costs while retain the ability to inherit from various large pretrained models.
Our model achieves state-of-the-art performances among all transformer-based models on the long-range modeling benchmark LRA.
For generative seq-to-seq tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our model outperforms the standard BART.
arXiv Detail & Related papers (2023-05-24T12:33:06Z) - Transform Once: Efficient Operator Learning in Frequency Domain [69.74509540521397]
We study deep neural networks designed to harness the structure in frequency domain for efficient learning of long-range correlations in space or time.
This work introduces a blueprint for frequency domain learning through a single transform: transform once (T1)
arXiv Detail & Related papers (2022-11-26T01:56:05Z) - FAMLP: A Frequency-Aware MLP-Like Architecture For Domain Generalization [73.41395947275473]
We propose a novel frequency-aware architecture, in which the domain-specific features are filtered out in the transformed frequency domain.
Experiments on three benchmarks demonstrate significant performance, outperforming the state-of-the-art methods by a margin of 3%, 4% and 9%, respectively.
arXiv Detail & Related papers (2022-03-24T07:26:29Z) - FEDformer: Frequency Enhanced Decomposed Transformer for Long-term
Series Forecasting [23.199388386249215]
We propose to combine Transformer with the seasonal-trend decomposition method, in which the decomposition method captures the global profile of time series.
We exploit the fact that most time series tend to have a sparse representation in well-known basis such as Fourier transform.
Besides being more effective, the proposed method, termed as Frequency Enhanced Decomposed Transformer (bf FEDformer), is more efficient than standard Transformer.
arXiv Detail & Related papers (2022-01-30T06:24:25Z) - New Approaches to Long Document Summarization: Fourier Transform Based
Attention in a Transformer Model [0.0]
We extensively redesign the newly introduced method of token mixing using Fourier Transforms (FNET) to replace the computationally expensive self-attention mechanism.
We also carry out long document summarization using established methods that are capable of processing over 8000 tokens.
All modifications showed better performance on the summarization task than when using the original FNET encoder in a transformer architecture.
arXiv Detail & Related papers (2021-11-25T18:03:41Z) - TCCT: Tightly-Coupled Convolutional Transformer on Time Series
Forecasting [6.393659160890665]
We propose the concept of tightly-coupled convolutional Transformer(TCCT) and three TCCT architectures.
Our experiments on real-world datasets show that our TCCT architectures could greatly improve the performance of existing state-of-art Transformer models.
arXiv Detail & Related papers (2021-08-29T08:49:31Z) - FNet: Mixing Tokens with Fourier Transforms [0.578717214982749]
We show that Transformer encoder architectures can be massively sped up with limited accuracy costs.
We replace the self-attention sublayers with simple linear transformations that "mix" input tokens.
The resulting model, which we name FNet, scales very efficiently to long inputs.
arXiv Detail & Related papers (2021-05-09T03:32:48Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.