The FFT Strikes Back: An Efficient Alternative to Self-Attention
- URL: http://arxiv.org/abs/2502.18394v4
- Date: Thu, 06 Mar 2025 15:39:55 GMT
- Title: The FFT Strikes Back: An Efficient Alternative to Self-Attention
- Authors: Jacob Fein-Ashley,
- Abstract summary: We introduce bfFFTNet, an adaptive spectral filtering framework for long sequences.<n>By transforming inputs into the frequency domain, FFTNet exploits Parseval's theorem to capture long-range dependencies efficiently.<n>Our main theoretical contributions are 1) an adaptive spectral filter, 2) combining local windowing with a global FFT branch, and 3) rich nonlinearity introduction in both the frequency and token domains.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conventional self-attention mechanisms incur quadratic complexity, limiting their scalability on long sequences. We introduce \textbf{FFTNet}, an adaptive spectral filtering framework that leverages the Fast Fourier Transform (FFT) to achieve global token mixing in $\mathcal{O}(n\log n)$ time. By transforming inputs into the frequency domain, FFTNet exploits the orthogonality and energy preservation guaranteed by Parseval's theorem to capture long-range dependencies efficiently. Our main theoretical contributions are 1) an adaptive spectral filter, 2) combining local windowing with a global FFT branch, and 3) rich nonlinearity introduction in both the frequency and token domains. Experiments on the Long Range Arena and ImageNet benchmarks validate our theoretical insights and demonstrate superior performance over fixed Fourier and standard attention models.
Related papers
- Block Circulant Adapter for Large Language Models [10.353352027807272]
Fine-tuning large language models (LLMs) is difficult due to their huge model size.
Recent Fourier domain-based methods show potential for reducing fine-tuning costs.
We propose a block circulant-based matrix fine-tuning method with a stable training to leverage the properties of circulant matrices.
arXiv Detail & Related papers (2025-05-01T15:14:32Z) - FlatQuant: Flatness Matters for LLM Quantization [58.28221892035609]
We propose FlatQuant, a new post-training quantization approach to enhance flatness of weights and activations.
Our approach identifies optimal affine transformations tailored to each linear layer, calibrated in hours via a lightweight objective runtime.
For inference latency, FlatQuant reduces the slowdown induced by pre-quantization transformation from 0.26x of QuaRot to merely $textbf0.07x$, bringing up to $textbf2.3x$ speedup for prefill and $textbf1.7x$ speedup for decoding.
arXiv Detail & Related papers (2024-10-12T08:10:28Z) - Neural Fourier Modelling: A Highly Compact Approach to Time-Series Analysis [9.969451740838418]
We introduce Neural Fourier Modelling (NFM), a compact yet powerful solution for time-series analysis.
NFM is grounded in two key properties of the Fourier transform (FT): (i) the ability to model finite-length time series as functions in the Fourier domain, and (ii) the capacity for data manipulation within the Fourier domain.
NFM achieves state-of-the-art performance on a wide range of tasks, including challenging time-series scenarios with previously unseen sampling rates at test time.
arXiv Detail & Related papers (2024-10-07T02:39:55Z) - DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention [82.24166963631949]
Diffusion Gated Linear Attention Transformers (DiG) is a simple, adoptable solution with minimal parameter overhead.
We offer two variants, i,e, a plain and U-shape architecture, showing superior efficiency and competitive effectiveness.
arXiv Detail & Related papers (2024-05-28T17:59:33Z) - Parameter-Efficient Fine-Tuning with Discrete Fourier Transform [26.563344030824414]
Low-rank adaptation(LoRA) has recently gained much interest in fine-tuning foundation models.
We introduce FourierFT, which treats $Delta W$ as a matrix in the spatial domain and learns only a small fraction of its spectral coefficients.
Our method shows comparable or better performance with fewer parameters than LoRA on various tasks.
arXiv Detail & Related papers (2024-05-05T17:15:24Z) - ATFNet: Adaptive Time-Frequency Ensembled Network for Long-term Time Series Forecasting [7.694820760102176]
ATFNet is an innovative framework that combines a time domain module and a frequency domain module.
We introduce Dominant Harmonic Series Energy Weighting, a novel mechanism for adjusting the weights between the two modules.
Our Complex-valued Spectrum Attention mechanism offers a novel approach to discern the intricate relationships between different frequency combinations.
arXiv Detail & Related papers (2024-04-08T04:41:39Z) - Tracking Meets LoRA: Faster Training, Larger Model, Stronger Performance [87.19164603145056]
We propose LoRAT, a method that unveils the power of large ViT model for tracking within laboratory-level resources.
The essence of our work lies in adapting LoRA, a technique that fine-tunes a small subset of model parameters without adding inference latency.
We design an anchor-free head solely based on to adapt PETR, enabling better performance with less computational overhead.
arXiv Detail & Related papers (2024-03-08T11:41:48Z) - FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency
Trade-off in Language Model Inference [57.119047493787185]
This paper shows how to reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
In practice, our method can reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-08T17:29:16Z) - FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor
Cores [18.016204763652553]
Convolution models with long filters have demonstrated state-of-the-art reasoning abilities in many long-sequence tasks.
Fast Fourier Transform (FFT) allows long convolutions to run in $O(N logN)$ time in sequence length $N$ but has poor hardware utilization.
In this paper, we study how to optimize the FFT convolution.
arXiv Detail & Related papers (2023-11-10T07:33:35Z) - p-Laplacian Transformer [7.2541371193810384]
$p$-Laplacian regularization, rooted in graph and image signal processing, introduces a parameter $p$ to control the regularization effect on these data.
We first show that the self-attention mechanism obtains the minimal Laplacian regularization.
We then propose a novel class of transformers, namely the $p$-Laplacian Transformer (p-LaT)
arXiv Detail & Related papers (2023-11-06T16:25:56Z) - WFTNet: Exploiting Global and Local Periodicity in Long-term Time Series
Forecasting [61.64303388738395]
We propose a Wavelet-Fourier Transform Network (WFTNet) for long-term time series forecasting.
Tests on various time series datasets show WFTNet consistently outperforms other state-of-the-art baselines.
arXiv Detail & Related papers (2023-09-20T13:44:18Z) - Adaptive Frequency Filters As Efficient Global Token Mixers [100.27957692579892]
We show that adaptive frequency filters can serve as efficient global token mixers.
We take AFF token mixers as primary neural operators to build a lightweight neural network, dubbed AFFNet.
arXiv Detail & Related papers (2023-07-26T07:42:28Z) - Recurrence without Recurrence: Stable Video Landmark Detection with Deep
Equilibrium Models [96.76758318732308]
We show that the recently proposed Deep Equilibrium Model (DEQ) can be naturally adapted to this form of computation.
Our Landmark DEQ (LDEQ) achieves state-of-the-art performance on the WFLW facial landmark dataset.
arXiv Detail & Related papers (2023-04-02T19:08:02Z) - Dynamic Temporal Filtering in Video Models [128.02725199486719]
We present a new recipe of temporal feature learning, namely Dynamic Temporal Filter (DTF)
DTF learns a specialized frequency filter for every spatial location to model its long-range temporal dynamics.
It is feasible to plug DTF block into ConvNets and Transformer, yielding DTF-Net and DTF-Transformer.
arXiv Detail & Related papers (2022-11-15T15:59:28Z) - Minimax-Optimal Multi-Agent RL in Zero-Sum Markov Games With a
Generative Model [50.38446482252857]
Two-player zero-sum Markov games are arguably the most basic setting in multi-agent reinforcement learning.
We develop a learning algorithm that learns an $varepsilon$-approximate Markov NE policy using $$ widetildeObigg.
We derive a refined regret bound for FTRL that makes explicit the role of variance-type quantities.
arXiv Detail & Related papers (2022-08-22T17:24:55Z) - Long-term Leap Attention, Short-term Periodic Shift for Video
Classification [41.87505528859225]
Video transformer naturally incurs a heavier computation burden than a static vision transformer.
We propose the LAPS, a long-term textbftextitLeap Attention'' (LAN), short-term textbftextitPeriodic Shift'' (textitP-Shift) module for video transformers.
arXiv Detail & Related papers (2022-07-12T13:30:15Z) - Deep Frequency Filtering for Domain Generalization [55.66498461438285]
Deep Neural Networks (DNNs) have preferences for some frequency components in the learning process.
We propose Deep Frequency Filtering (DFF) for learning domain-generalizable features.
We show that applying our proposed DFF on a plain baseline outperforms the state-of-the-art methods on different domain generalization tasks.
arXiv Detail & Related papers (2022-03-23T05:19:06Z) - Functional Regularization for Reinforcement Learning via Learned Fourier
Features [98.90474131452588]
We propose a simple architecture for deep reinforcement learning by embedding inputs into a learned Fourier basis.
We show that it improves the sample efficiency of both state-based and image-based RL.
arXiv Detail & Related papers (2021-12-06T18:59:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.