Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference
- URL: http://arxiv.org/abs/2211.10526v5
- Date: Thu, 25 Jul 2024 17:29:22 GMT
- Title: Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference
- Authors: Haoran You, Yunyang Xiong, Xiaoliang Dai, Bichen Wu, Peizhao Zhang, Haoqi Fan, Peter Vajda, Yingyan Celine Lin,
- Abstract summary: Vision Transformers (ViTs) have shown impressive performance but still require a high computation cost as compared to convolutional neural networks (CNNs)
Existing efficient ViTs adopt local attention (e.g., Swin) or linear attention (e.g., Performer)
We propose a framework called Castling-ViT, which trains ViTs using both linear-angular attention and masked softmax-based quadratic attention.
- Score: 33.69340426607746
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Vision Transformers (ViTs) have shown impressive performance but still require a high computation cost as compared to convolutional neural networks (CNNs), one reason is that ViTs' attention measures global similarities and thus has a quadratic complexity with the number of input tokens. Existing efficient ViTs adopt local attention (e.g., Swin) or linear attention (e.g., Performer), which sacrifice ViTs' capabilities of capturing either global or local context. In this work, we ask an important research question: Can ViTs learn both global and local context while being more efficient during inference? To this end, we propose a framework called Castling-ViT, which trains ViTs using both linear-angular attention and masked softmax-based quadratic attention, but then switches to having only linear angular attention during ViT inference. Our Castling-ViT leverages angular kernels to measure the similarities between queries and keys via spectral angles. And we further simplify it with two techniques: (1) a novel linear-angular attention mechanism: we decompose the angular kernels into linear terms and high-order residuals, and only keep the linear terms; and (2) we adopt two parameterized modules to approximate high-order residuals: a depthwise convolution and an auxiliary masked softmax attention to help learn both global and local information, where the masks for softmax attention are regularized to gradually become zeros and thus incur no overhead during ViT inference. Extensive experiments and ablation studies on three tasks consistently validate the effectiveness of the proposed Castling-ViT, e.g., achieving up to a 1.8% higher accuracy or 40% MACs reduction on ImageNet classification and 1.2 higher mAP on COCO detection under comparable FLOPs, as compared to ViTs with vanilla softmax-based attentions.
Related papers
- The Linear Attention Resurrection in Vision Transformer [0.6798775532273751]
Vision Transformers (ViTs) have recently taken computer vision by storm.
Softmax attention underlying ViTs comes with a quadratic complexity in time and memory, hindering the application of ViTs to high-resolution images.
We propose a linear attention method to address the limitation, which doesn't sacrifice ViT's core advantage of capturing global representation.
arXiv Detail & Related papers (2025-01-27T16:29:17Z) - From Saliency to DINO: Saliency-guided Vision Transformer for Few-shot
Keypoint Detection [36.9781808268263]
Few-shot keypoint detection (FSKD) attempts to localize any keypoints, including novel or base keypoints, depending on the reference samples.
FSKD requires the semantically meaningful relations for keypoint similarity learning to overcome the ubiquitous noise and ambiguous local patterns.
We present a novel saliency-guided vision transformer, dubbed SalViT, for few-shot keypoint detection.
arXiv Detail & Related papers (2023-04-06T15:22:34Z) - A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism.
We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions.
With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z) - ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision
Transformer Acceleration with a Linear Taylor Attention [23.874485033096917]
Vision Transformer (ViT) has emerged as a competitive alternative to convolutional neural networks for various computer vision applications.
We propose a first-of-its-kind algorithm- hardware codesigned framework, dubbed ViTALiTy, for boosting the inference efficiency of ViTs.
ViTALiTy unifies both low-rank and sparse components of the attention in ViTs.
arXiv Detail & Related papers (2022-11-09T18:58:21Z) - LightViT: Towards Light-Weight Convolution-Free Vision Transformers [43.48734363817069]
Vision transformers (ViTs) are usually considered to be less light-weight than convolutional neural networks (CNNs)
We present LightViT as a new family of light-weight ViTs to achieve better accuracy-efficiency balance upon the pure transformer blocks without convolution.
Experiments show that our model achieves significant improvements on image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2022-07-12T14:27:57Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z) - Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain
Analysis: From Theory to Practice [111.47461527901318]
Vision Transformer (ViT) has recently demonstrated promise in computer vision problems.
ViT saturates quickly with depth increasing, due to the observed attention collapse or patch uniformity.
We propose two techniques to mitigate the undesirable low-pass limitation.
arXiv Detail & Related papers (2022-03-09T23:55:24Z) - Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks.
We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs.
Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.