Wave-ViT: Unifying Wavelet and Transformers for Visual Representation
Learning
- URL: http://arxiv.org/abs/2207.04978v1
- Date: Mon, 11 Jul 2022 16:03:51 GMT
- Title: Wave-ViT: Unifying Wavelet and Transformers for Visual Representation
Learning
- Authors: Ting Yao and Yingwei Pan and Yehao Li and Chong-Wah Ngo and Tao Mei
- Abstract summary: Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone for computer vision tasks.
We propose a new Wavelet Vision Transformer (textbfWave-ViT) that formulates the invertible down-sampling with wavelet transforms and self-attention learning.
- Score: 138.29273453811945
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone for
computer vision tasks, while the self-attention computation in Transformer
scales quadratically w.r.t. the input patch number. Thus, existing solutions
commonly employ down-sampling operations (e.g., average pooling) over
keys/values to dramatically reduce the computational cost. In this work, we
argue that such over-aggressive down-sampling design is not invertible and
inevitably causes information dropping especially for high-frequency components
in objects (e.g., texture details). Motivated by the wavelet theory, we
construct a new Wavelet Vision Transformer (\textbf{Wave-ViT}) that formulates
the invertible down-sampling with wavelet transforms and self-attention
learning in a unified way. This proposal enables self-attention learning with
lossless down-sampling over keys/values, facilitating the pursuing of a better
efficiency-vs-accuracy trade-off. Furthermore, inverse wavelet transforms are
leveraged to strengthen self-attention outputs by aggregating local contexts
with enlarged receptive field. We validate the superiority of Wave-ViT through
extensive experiments over multiple vision tasks (e.g., image recognition,
object detection and instance segmentation). Its performances surpass
state-of-the-art ViT backbones with comparable FLOPs. Source code is available
at \url{https://github.com/YehLi/ImageNetModel}.
Related papers
- Differential Transformer [99.5117269150629]
Transformer tends to overallocate attention to irrelevant context.
We introduce Diff Transformer, which amplifies attention to relevant context while canceling noise.
It offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers.
arXiv Detail & Related papers (2024-10-07T17:57:38Z) - Wavelet-based Bi-dimensional Aggregation Network for SAR Image Change Detection [53.842568573251214]
Experimental results on three SAR datasets demonstrate that our WBANet significantly outperforms contemporary state-of-the-art methods.
Our WBANet achieves 98.33%, 96.65%, and 96.62% of percentage of correct classification (PCC) on the respective datasets.
arXiv Detail & Related papers (2024-07-18T04:36:10Z) - Spiking Wavelet Transformer [1.8712213089437697]
Spiking neural networks (SNNs) offer an energy-efficient alternative to conventional deep learning.
Transformers with SNNs have shown promise for accuracy, but struggle to learn high-frequency patterns.
We propose the Spiking Wavelet Transformer (SWformer), an attention-free architecture that effectively learns comprehensive spatial-frequency features in a spike-driven manner.
arXiv Detail & Related papers (2024-03-17T08:41:48Z) - Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain
Analysis: From Theory to Practice [111.47461527901318]
Vision Transformer (ViT) has recently demonstrated promise in computer vision problems.
ViT saturates quickly with depth increasing, due to the observed attention collapse or patch uniformity.
We propose two techniques to mitigate the undesirable low-pass limitation.
arXiv Detail & Related papers (2022-03-09T23:55:24Z) - Vision Transformer with Deformable Attention [29.935891419574602]
Large, sometimes even global, receptive field endows Transformer models with higher representation power over their CNN counterparts.
We propose a novel deformable self-attention module, where the positions of key and value pairs in self-attention are selected in a data-dependent way.
We present Deformable Attention Transformer, a general backbone model with deformable attention for both image classification and dense prediction tasks.
arXiv Detail & Related papers (2022-01-03T08:29:01Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [76.16156833138038]
We propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
In each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network.
arXiv Detail & Related papers (2021-06-07T05:31:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.