WavSpA: Wavelet Space Attention for Boosting Transformers' Long Sequence
Learning Ability
- URL: http://arxiv.org/abs/2210.01989v3
- Date: Mon, 22 May 2023 22:42:47 GMT
- Title: WavSpA: Wavelet Space Attention for Boosting Transformers' Long Sequence
Learning Ability
- Authors: Yufan Zhuang, Zihan Wang, Fangbo Tao, Jingbo Shang
- Abstract summary: Recent works show that learning attention in the Fourier space can improve the long sequence learning capability of Transformers.
We argue that wavelet transform shall be a better choice because it captures both position and frequency information with linear time complexity.
We propose Wavelet Space Attention (WavSpA) that facilitates attention learning in a learnable wavelet coefficient space.
- Score: 31.791279777902957
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer and its variants are fundamental neural architectures in deep
learning. Recent works show that learning attention in the Fourier space can
improve the long sequence learning capability of Transformers. We argue that
wavelet transform shall be a better choice because it captures both position
and frequency information with linear time complexity. Therefore, in this
paper, we systematically study the synergy between wavelet transform and
Transformers. We propose Wavelet Space Attention (WavSpA) that facilitates
attention learning in a learnable wavelet coefficient space which replaces the
attention in Transformers by (1) applying forward wavelet transform to project
the input sequences to multi-resolution bases, (2) conducting attention
learning in the wavelet coefficient space, and (3) reconstructing the
representation in input space via backward wavelet transform. Extensive
experiments on the Long Range Arena demonstrate that learning attention in the
wavelet space using either fixed or adaptive wavelets can consistently improve
Transformer's performance and also significantly outperform learning in Fourier
space. We further show our method can enhance Transformer's reasoning
extrapolation capability over distance on the LEGO chain-of-reasoning task.
Related papers
- Differential Transformer [99.5117269150629]
Transformer tends to overallocate attention to irrelevant context.
We introduce Diff Transformer, which amplifies attention to relevant context while canceling noise.
It offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers.
arXiv Detail & Related papers (2024-10-07T17:57:38Z) - Steerable Transformers [5.564976582065106]
We introduce Steerable Transformers, an extension of the Vision Transformer mechanism.
We propose an equivariant attention mechanism that operates on features extracted by steerable convolutions.
arXiv Detail & Related papers (2024-05-24T20:43:19Z) - Spiking Wavelet Transformer [1.8712213089437697]
Spiking neural networks (SNNs) offer an energy-efficient alternative to conventional deep learning.
Transformers with SNNs have shown promise for accuracy, but struggle to learn high-frequency patterns.
We propose the Spiking Wavelet Transformer (SWformer), an attention-free architecture that effectively learns comprehensive spatial-frequency features in a spike-driven manner.
arXiv Detail & Related papers (2024-03-17T08:41:48Z) - Multi-Scale Wavelet Transformer for Face Forgery Detection [43.33712402517951]
We propose a multi-scale wavelet transformer framework for face forgery detection.
Frequency-based spatial attention is designed to guide the spatial feature extractor to concentrate more on forgery traces.
Cross-modality attention is proposed to fuse the frequency features with the spatial features.
arXiv Detail & Related papers (2022-10-08T03:39:36Z) - Wave-ViT: Unifying Wavelet and Transformers for Visual Representation
Learning [138.29273453811945]
Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone for computer vision tasks.
We propose a new Wavelet Vision Transformer (textbfWave-ViT) that formulates the invertible down-sampling with wavelet transforms and self-attention learning.
arXiv Detail & Related papers (2022-07-11T16:03:51Z) - SepTr: Separable Transformer for Audio Spectrogram Processing [74.41172054754928]
We propose a new vision transformer architecture called Separable Transformer (SepTr)
SepTr employs two transformer blocks in a sequential manner, the first attending to tokens within the same frequency bin, and the second attending to tokens within the same time interval.
We conduct experiments on three benchmark data sets, showing that our architecture outperforms conventional vision transformers and other state-of-the-art methods.
arXiv Detail & Related papers (2022-03-17T19:48:43Z) - aiWave: Volumetric Image Compression with 3-D Trained Affine
Wavelet-like Transform [43.984890290691695]
Most commonly used volumetric image compression methods are based on wavelet transform, such as JP3D.
In this paper, we first design a 3-D trained wavelet-like transform to enable signal-dependent and non-separable transform.
Then, an affine wavelet basis is introduced to capture the various local correlations in different regions of volumetric images.
arXiv Detail & Related papers (2022-03-11T10:02:01Z) - Continual Transformers: Redundancy-Free Attention for Online Inference [86.3361797111839]
We propose novel formulations of the Scaled Dot-Product Attention, which enable Transformers to perform efficient online token-by-token inference in a continual input stream.
Our modification is purely to the order of computations, while the produced outputs and learned weights are identical to those of the original Multi-Head Attention.
arXiv Detail & Related papers (2022-01-17T08:20:09Z) - Stable, Fast and Accurate: Kernelized Attention with Relative Positional
Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE)
Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z) - Scalable Transformers for Neural Machine Translation [86.4530299266897]
Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation.
We propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters.
A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers.
arXiv Detail & Related papers (2021-06-04T04:04:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.