FourierNAT: A Fourier-Mixing-Based Non-Autoregressive Transformer for Parallel Sequence Generation
- URL: http://arxiv.org/abs/2503.07630v2
- Date: Tue, 15 Apr 2025 21:21:55 GMT
- Title: FourierNAT: A Fourier-Mixing-Based Non-Autoregressive Transformer for Parallel Sequence Generation
- Authors: Andrew Kiruluta, Eric Lundy, Andreas Lemos,
- Abstract summary: We present a novel non-autoregressive Transformer (NAT) architecture that employs Fourier-based mixing in the decoder to generate output sequences in parallel.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present FourierNAT, a novel non-autoregressive Transformer (NAT) architecture that employs Fourier-based mixing in the decoder to generate output sequences in parallel. While traditional NAT approaches often face challenges with capturing global dependencies, our method leverages a discrete Fourier transform to mix token embeddings across the entire sequence dimension, coupled with learned frequency-domain gating. This allows the model to efficiently propagate context without explicit autoregressive steps. Empirically, FourierNAT achieves competitive results against leading NAT baselines on standard benchmarks like WMT machine translation and CNN/DailyMail summarization, providing significant speed advantages over autoregressive Transformers. We further demonstrate that learned frequency-domain parameters allow the model to adaptively focus on long-range or short-range dependencies, partially mitigating the well-known coherence gaps in one-pass NAT generation. Overall, FourierNAT highlights the potential of integrating spectral-domain operations to accelerate and improve parallel text generation. This approach can potentially provide great computational and time savings in inference tasks LLMs.
Related papers
- Learnable Multi-Scale Wavelet Transformer: A Novel Alternative to Self-Attention [0.0]
Learnable Multi-Scale Wavelet Transformer (LMWT) is a novel architecture that replaces the standard dot-product self-attention.
We present the detailed mathematical formulation of the learnable Haar wavelet module and its integration into the transformer framework.
Our results indicate that the LMWT achieves competitive performance while offering substantial computational advantages.
arXiv Detail & Related papers (2025-04-08T22:16:54Z) - Efficient Time Series Forecasting via Hyper-Complex Models and Frequency Aggregation [1.024113475677323]
Time series forecasting is a long-standing problem in statistics and machine learning.
We propose the Frequency Information Aggregation (FIA)-Net, which is based on a novel complex-valued architecture.
We evaluate the FIA-Net on various time-series benchmarks and show that the proposed methodologies outperform existing state of the art methods in terms of both accuracy and efficiency.
arXiv Detail & Related papers (2025-02-27T11:03:37Z) - The FFT Strikes Back: An Efficient Alternative to Self-Attention [0.0]
We introduce bfFFTNet, an adaptive spectral filtering framework for long sequences.<n>By transforming inputs into the frequency domain, FFTNet exploits Parseval's theorem to capture long-range dependencies efficiently.<n>Our main theoretical contributions are 1) an adaptive spectral filter, 2) combining local windowing with a global FFT branch, and 3) rich nonlinearity introduction in both the frequency and token domains.
arXiv Detail & Related papers (2025-02-25T17:43:43Z) - Variable-size Symmetry-based Graph Fourier Transforms for image compression [65.7352685872625]
We propose a new family of Symmetry-based Graph Fourier Transforms of variable sizes into a coding framework.
Our proposed algorithm generates symmetric graphs on the grid by adding specific symmetrical connections between nodes.
Experiments show that SBGFTs outperform the primary transforms integrated in the explicit Multiple Transform Selection.
arXiv Detail & Related papers (2024-11-24T13:00:44Z) - PRformer: Pyramidal Recurrent Transformer for Multivariate Time Series Forecasting [82.03373838627606]
Self-attention mechanism in Transformer architecture requires positional embeddings to encode temporal order in time series prediction.
We argue that this reliance on positional embeddings restricts the Transformer's ability to effectively represent temporal sequences.
We present a model integrating PRE with a standard Transformer encoder, demonstrating state-of-the-art performance on various real-world datasets.
arXiv Detail & Related papers (2024-08-20T01:56:07Z) - WFTNet: Exploiting Global and Local Periodicity in Long-term Time Series
Forecasting [61.64303388738395]
We propose a Wavelet-Fourier Transform Network (WFTNet) for long-term time series forecasting.
Tests on various time series datasets show WFTNet consistently outperforms other state-of-the-art baselines.
arXiv Detail & Related papers (2023-09-20T13:44:18Z) - Optimizing Non-Autoregressive Transformers with Contrastive Learning [74.46714706658517]
Non-autoregressive Transformers (NATs) reduce the inference latency of Autoregressive Transformers (ATs) by predicting words all at once rather than in sequential order.
In this paper, we propose to ease the difficulty of modality learning via sampling from the model distribution instead of the data distribution.
arXiv Detail & Related papers (2023-05-23T04:20:13Z) - Directed Acyclic Transformer for Non-Autoregressive Machine Translation [93.31114105366461]
Directed Acyclic Transfomer (DA-Transformer) represents hidden states in a Directed Acyclic Graph (DAG)
DA-Transformer substantially outperforms previous NATs by about 3 BLEU on average.
arXiv Detail & Related papers (2022-05-16T06:02:29Z) - Beyond Self Attention: A Subquadratic Fourier Wavelet Transformer with Multi Modal Fusion [0.0]
We revisit the use of spectral techniques to replace the attention mechanism in Transformers.
We present a comprehensive and novel reformulation of this technique in next generation transformer models.
arXiv Detail & Related papers (2021-11-25T18:03:41Z) - Adaptive Fourier Neural Operators: Efficient Token Mixers for
Transformers [55.90468016961356]
We propose an efficient token mixer that learns to mix in the Fourier domain.
AFNO is based on a principled foundation of operator learning.
It can handle a sequence size of 65k and outperforms other efficient self-attention mechanisms.
arXiv Detail & Related papers (2021-11-24T05:44:31Z) - FNetAR: Mixing Tokens with Autoregressive Fourier Transforms [0.0]
We show that FNetAR retains state-of-the-art performance (25.8 ppl) on the task of causal language modeling.
The autoregressive Fourier transform could likely be used for parameter on most Transformer-based time-series prediction models.
arXiv Detail & Related papers (2021-07-22T21:24:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.