Scattering Vision Transformer: Spectral Mixing Matters
- URL: http://arxiv.org/abs/2311.01310v2
- Date: Mon, 20 Nov 2023 13:08:27 GMT
- Title: Scattering Vision Transformer: Spectral Mixing Matters
- Authors: Badri N. Patro and Vijay Srinivas Agneeswaran
- Abstract summary: We present a novel approach called Scattering Vision Transformer (SVT) to tackle these challenges.
SVT incorporates a spectrally scattering network that enables the capture of intricate image details.
SVT achieves state-of-the-art performance on the ImageNet dataset with a significant reduction in a number of parameters and FLOPS.
- Score: 3.0665715162712837
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Vision transformers have gained significant attention and achieved
state-of-the-art performance in various computer vision tasks, including image
classification, instance segmentation, and object detection. However,
challenges remain in addressing attention complexity and effectively capturing
fine-grained information within images. Existing solutions often resort to
down-sampling operations, such as pooling, to reduce computational cost.
Unfortunately, such operations are non-invertible and can result in information
loss. In this paper, we present a novel approach called Scattering Vision
Transformer (SVT) to tackle these challenges. SVT incorporates a spectrally
scattering network that enables the capture of intricate image details. SVT
overcomes the invertibility issue associated with down-sampling operations by
separating low-frequency and high-frequency components. Furthermore, SVT
introduces a unique spectral gating network utilizing Einstein multiplication
for token and channel mixing, effectively reducing complexity. We show that SVT
achieves state-of-the-art performance on the ImageNet dataset with a
significant reduction in a number of parameters and FLOPS. SVT shows 2\%
improvement over LiTv2 and iFormer. SVT-H-S reaches 84.2\% top-1 accuracy,
while SVT-H-B reaches 85.2\% (state-of-art for base versions) and SVT-H-L
reaches 85.7\% (again state-of-art for large versions). SVT also shows
comparable results in other vision tasks such as instance segmentation. SVT
also outperforms other transformers in transfer learning on standard datasets
such as CIFAR10, CIFAR100, Oxford Flower, and Stanford Car datasets. The
project page is available on this
webpage.\url{https://badripatro.github.io/svt/}.
Related papers
- HaltingVT: Adaptive Token Halting Transformer for Efficient Video
Recognition [11.362605513514943]
Action recognition in videos poses a challenge due to its high computational cost.
We propose HaltingVT, an efficient video transformer adaptively removing redundant video patch tokens.
On the Mini-Kinetics dataset, we achieved 75.0% top-1 ACC with 24.2 GFLOPs, as well as 67.2% top-1 ACC with an extremely low 9.9 GFLOPs.
arXiv Detail & Related papers (2024-01-10T07:42:55Z) - Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT)
Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z) - Locality Guidance for Improving Vision Transformers on Tiny Datasets [17.352384588114838]
Vision Transformer (VT) architecture is becoming trendy in computer vision, but pure VT models perform poorly on tiny datasets.
This paper proposes the locality guidance for improving the performance of VTs on tiny datasets.
arXiv Detail & Related papers (2022-07-20T16:41:41Z) - Coarse-to-Fine Vision Transformer [83.45020063642235]
We propose a coarse-to-fine vision transformer (CF-ViT) to relieve computational burden while retaining performance.
Our proposed CF-ViT is motivated by two important observations in modern ViT models.
Our CF-ViT reduces 53% FLOPs of LV-ViT, and also achieves 2.01x throughput.
arXiv Detail & Related papers (2022-03-08T02:57:49Z) - Efficient Training of Visual Transformers with Small-Size Datasets [64.60765211331697]
Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs)
We show that, despite having a comparable accuracy when trained on ImageNet, their performance on smaller datasets can be largely different.
We propose a self-supervised task which can extract additional information from images with only a negligible computational overhead.
arXiv Detail & Related papers (2021-06-07T16:14:06Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.