Related papers: WavSpA: Wavelet Space Attention for Boosting Transformers' Long Sequence Learning Ability

WavSpA: Wavelet Space Attention for Boosting Transformers' Long Sequence Learning Ability

URL: http://arxiv.org/abs/2210.01989v3
Date: Mon, 22 May 2023 22:42:47 GMT
Title: WavSpA: Wavelet Space Attention for Boosting Transformers' Long Sequence Learning Ability
Authors: Yufan Zhuang, Zihan Wang, Fangbo Tao, Jingbo Shang
Abstract summary: Recent works show that learning attention in the Fourier space can improve the long sequence learning capability of Transformers. We argue that wavelet transform shall be a better choice because it captures both position and frequency information with linear time complexity. We propose Wavelet Space Attention (WavSpA) that facilitates attention learning in a learnable wavelet coefficient space.
Score: 31.791279777902957
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformer and its variants are fundamental neural architectures in deep learning. Recent works show that learning attention in the Fourier space can improve the long sequence learning capability of Transformers. We argue that wavelet transform shall be a better choice because it captures both position and frequency information with linear time complexity. Therefore, in this paper, we systematically study the synergy between wavelet transform and Transformers. We propose Wavelet Space Attention (WavSpA) that facilitates attention learning in a learnable wavelet coefficient space which replaces the attention in Transformers by (1) applying forward wavelet transform to project the input sequences to multi-resolution bases, (2) conducting attention learning in the wavelet coefficient space, and (3) reconstructing the representation in input space via backward wavelet transform. Extensive experiments on the Long Range Arena demonstrate that learning attention in the wavelet space using either fixed or adaptive wavelets can consistently improve Transformer's performance and also significantly outperform learning in Fourier space. We further show our method can enhance Transformer's reasoning extrapolation capability over distance on the LEGO chain-of-reasoning task.

Related papers

Differential Transformer [99.5117269150629]
Transformer tends to overallocate attention to irrelevant context. We introduce Diff Transformer, which amplifies attention to relevant context while canceling noise. It offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers.
arXiv Detail & Related papers (2024-10-07T17:57:38Z)
Steerable Transformers [5.564976582065106]
We introduce Steerable Transformers, an extension of the Vision Transformer mechanism. We propose an equivariant attention mechanism that operates on features extracted by steerable convolutions.
arXiv Detail & Related papers (2024-05-24T20:43:19Z)
Spiking Wavelet Transformer [1.8712213089437697]
Spiking neural networks (SNNs) offer an energy-efficient alternative to conventional deep learning. Transformers with SNNs have shown promise for accuracy, but struggle to learn high-frequency patterns. We propose the Spiking Wavelet Transformer (SWformer), an attention-free architecture that effectively learns comprehensive spatial-frequency features in a spike-driven manner.
arXiv Detail & Related papers (2024-03-17T08:41:48Z)
Multi-Scale Wavelet Transformer for Face Forgery Detection [43.33712402517951]
We propose a multi-scale wavelet transformer framework for face forgery detection. Frequency-based spatial attention is designed to guide the spatial feature extractor to concentrate more on forgery traces. Cross-modality attention is proposed to fuse the frequency features with the spatial features.
arXiv Detail & Related papers (2022-10-08T03:39:36Z)
Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning [138.29273453811945]
Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone for computer vision tasks. We propose a new Wavelet Vision Transformer (textbfWave-ViT) that formulates the invertible down-sampling with wavelet transforms and self-attention learning.
arXiv Detail & Related papers (2022-07-11T16:03:51Z)
SepTr: Separable Transformer for Audio Spectrogram Processing [74.41172054754928]
We propose a new vision transformer architecture called Separable Transformer (SepTr) SepTr employs two transformer blocks in a sequential manner, the first attending to tokens within the same frequency bin, and the second attending to tokens within the same time interval. We conduct experiments on three benchmark data sets, showing that our architecture outperforms conventional vision transformers and other state-of-the-art methods.
arXiv Detail & Related papers (2022-03-17T19:48:43Z)
aiWave: Volumetric Image Compression with 3-D Trained Affine Wavelet-like Transform [43.984890290691695]
Most commonly used volumetric image compression methods are based on wavelet transform, such as JP3D. In this paper, we first design a 3-D trained wavelet-like transform to enable signal-dependent and non-separable transform. Then, an affine wavelet basis is introduced to capture the various local correlations in different regions of volumetric images.
arXiv Detail & Related papers (2022-03-11T10:02:01Z)
Continual Transformers: Redundancy-Free Attention for Online Inference [86.3361797111839]
We propose novel formulations of the Scaled Dot-Product Attention, which enable Transformers to perform efficient online token-by-token inference in a continual input stream. Our modification is purely to the order of computations, while the produced outputs and learned weights are identical to those of the original Multi-Head Attention.
arXiv Detail & Related papers (2022-01-17T08:20:09Z)
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE) Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)
Scalable Transformers for Neural Machine Translation [86.4530299266897]
Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation. We propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters. A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers.
arXiv Detail & Related papers (2021-06-04T04:04:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.