SAC-ViT: Semantic-Aware Clustering Vision Transformer with Early Exit
- URL: http://arxiv.org/abs/2503.00060v1
- Date: Thu, 27 Feb 2025 02:24:22 GMT
- Title: SAC-ViT: Semantic-Aware Clustering Vision Transformer with Early Exit
- Authors: Youbing Hu, Yun Cheng, Anqi Lu, Dawei Wei, Zhijun Li,
- Abstract summary: Vision Transformer (ViT) excels in global modeling but faces deployment challenges on resource-constrained devices.<n>We propose the Semantic-Aware Clustering Vision Transformer (SAC-ViT) to enhance ViT's computational efficiency.
- Score: 6.87425726793675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Vision Transformer (ViT) excels in global modeling but faces deployment challenges on resource-constrained devices due to the quadratic computational complexity of its attention mechanism. To address this, we propose the Semantic-Aware Clustering Vision Transformer (SAC-ViT), a non-iterative approach to enhance ViT's computational efficiency. SAC-ViT operates in two stages: Early Exit (EE) and Semantic-Aware Clustering (SAC). In the EE stage, downsampled input images are processed to extract global semantic information and generate initial inference results. If these results do not meet the EE termination criteria, the information is clustered into target and non-target tokens. In the SAC stage, target tokens are mapped back to the original image, cropped, and embedded. These target tokens are then combined with reused non-target tokens from the EE stage, and the attention mechanism is applied within each cluster. This two-stage design, with end-to-end optimization, reduces spatial redundancy and enhances computational efficiency, significantly boosting overall ViT performance. Extensive experiments demonstrate the efficacy of SAC-ViT, reducing 62% of the FLOPs of DeiT and achieving 1.98 times throughput without compromising performance.
Related papers
- ECViT: Efficient Convolutional Vision Transformer with Local-Attention and Multi-scale Stages [0.0]
Vision Transformers (ViTs) have revolutionized computer vision by leveraging self-attention to model long-range dependencies.
We propose the Efficient Convolutional Vision Transformer (ECViT), a hybrid architecture that effectively combines the strengths of CNNs and Transformers.
arXiv Detail & Related papers (2025-04-21T03:00:17Z) - Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads [10.169639612525643]
We propose a new multi-head self-attention (MHSA) variant named Fibottention, which can replace MHSA in Transformer architectures.<n>Fibottention is data-efficient and computationally more suitable for processing large numbers of tokens than the standard MHSA.<n>It employs structured sparse attention based on dilated Fibonacci sequences, which, uniquely, differ across attention heads, resulting in-like diverse features across heads.
arXiv Detail & Related papers (2024-06-27T17:59:40Z) - Sharing Key Semantics in Transformer Makes Efficient Image Restoration [148.22790334216117]
Self-attention mechanism, a cornerstone of Vision Transformers (ViTs) tends to encompass all global cues.<n>Small segments of a degraded image, particularly those closely aligned semantically, provide particularly relevant information to aid in the restoration process.<n>We propose boosting IR's performance by sharing the key semantics via Transformer for IR (ie, SemanIR) in this paper.
arXiv Detail & Related papers (2024-05-30T12:45:34Z) - Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens [57.37893387775829]
We introduce a fast and balanced clustering method, named textbfSemantic textbfEquitable textbfClustering (SEC)
SEC clusters tokens based on their global semantic relevance in an efficient, straightforward manner.
We propose a versatile vision backbone, SECViT, to serve as a vision language connector.
arXiv Detail & Related papers (2024-05-22T04:49:00Z) - Once for Both: Single Stage of Importance and Sparsity Search for Vision Transformer Compression [63.23578860867408]
We investigate how to integrate the evaluations of importance and sparsity scores into a single stage.
We present OFB, a cost-efficient approach that simultaneously evaluates both importance and sparsity scores.
Experiments demonstrate that OFB can achieve superior compression performance over state-of-the-art searching-based and pruning-based methods.
arXiv Detail & Related papers (2024-03-23T13:22:36Z) - LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient
Image Recognition [9.727093171296678]
Vision Transformer (ViT) excels in accuracy when handling high-resolution images.
It confronts the challenge of significant spatial redundancy, leading to increased computational and memory requirements.
We present the Localization and Focus Vision Transformer (LF-ViT)
It operates by strategically curtailing computational demands without impinging on performance.
arXiv Detail & Related papers (2024-01-08T01:32:49Z) - TiC: Exploring Vision Transformer in Convolution [37.50285921899263]
We propose the Multi-Head Self-Attention Convolution (MSA-Conv)
MSA-Conv incorporates Self-Attention within generalized convolutions, including standard, dilated, and depthwise ones.
We present the Vision Transformer in Convolution (TiC) as a proof of concept for image classification with MSA-Conv.
arXiv Detail & Related papers (2023-10-06T10:16:26Z) - Skip-Attention: Improving Vision Transformers by Paying Less Attention [55.47058516775423]
Vision computation transformers (ViTs) use expensive self-attention operations in every layer.
We propose SkipAt, a method to reuse self-attention from preceding layers to approximate attention at one or more subsequent layers.
We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS.
arXiv Detail & Related papers (2023-01-05T18:59:52Z) - Iwin: Human-Object Interaction Detection via Transformer with Irregular
Windows [57.00864538284686]
Iwin Transformer is a hierarchical Transformer which progressively performs token representation learning and token agglomeration within irregular windows.
The effectiveness and efficiency of Iwin Transformer are verified on the two standard HOI detection benchmark datasets.
arXiv Detail & Related papers (2022-03-20T12:04:50Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.