LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient
Image Recognition
- URL: http://arxiv.org/abs/2402.00033v1
- Date: Mon, 8 Jan 2024 01:32:49 GMT
- Title: LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient
Image Recognition
- Authors: Youbing Hu, Yun Cheng, Anqi Lu, Zhiqiang Cao, Dawei Wei, Jie Liu,
Zhijun Li
- Abstract summary: Vision Transformer (ViT) excels in accuracy when handling high-resolution images.
It confronts the challenge of significant spatial redundancy, leading to increased computational and memory requirements.
We present the Localization and Focus Vision Transformer (LF-ViT)
It operates by strategically curtailing computational demands without impinging on performance.
- Score: 9.727093171296678
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Vision Transformer (ViT) excels in accuracy when handling high-resolution
images, yet it confronts the challenge of significant spatial redundancy,
leading to increased computational and memory requirements. To address this, we
present the Localization and Focus Vision Transformer (LF-ViT). This model
operates by strategically curtailing computational demands without impinging on
performance. In the Localization phase, a reduced-resolution image is
processed; if a definitive prediction remains elusive, our pioneering
Neighborhood Global Class Attention (NGCA) mechanism is triggered, effectively
identifying and spotlighting class-discriminative regions based on initial
findings. Subsequently, in the Focus phase, this designated region is used from
the original image to enhance recognition. Uniquely, LF-ViT employs consistent
parameters across both phases, ensuring seamless end-to-end optimization. Our
empirical tests affirm LF-ViT's prowess: it remarkably decreases Deit-S's FLOPs
by 63\% and concurrently amplifies throughput twofold. Code of this project is
at https://github.com/edgeai1/LF-ViT.git.
Related papers
- LUM-ViT: Learnable Under-sampling Mask Vision Transformer for Bandwidth
Limited Optical Signal Acquisition [14.773452863027037]
We introduce a novel approach leveraging pre-acquisition modulation to reduce the acquisition volume.
Uniquely, LUM-ViT incorporates a learnable under-sampling mask tailored for pre-acquisition modulation.
Our evaluations reveal that, by sampling a mere 10% of the original image pixels, LUM-ViT maintains the accuracy loss within 1.8% on the ImageNet classification task.
arXiv Detail & Related papers (2024-03-03T06:49:01Z) - Denoising Vision Transformers [43.03068202384091]
We propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT)
In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis.
In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision.
arXiv Detail & Related papers (2024-01-05T18:59:52Z) - ViT-Calibrator: Decision Stream Calibration for Vision Transformer [49.60474757318486]
We propose a new paradigm dubbed Decision Stream that boosts the performance of general Vision Transformers.
We shed light on the information propagation mechanism in the learning procedure by exploring the correlation between different tokens and the relevance coefficient of multiple dimensions.
arXiv Detail & Related papers (2023-04-10T02:40:24Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - GradViT: Gradient Inversion of Vision Transformers [83.54779732309653]
We demonstrate the vulnerability of vision transformers (ViTs) to gradient-based inversion attacks.
We introduce a method, named GradViT, that optimize random noise into naturally looking images.
We observe unprecedentedly high fidelity and closeness to the original (hidden) data.
arXiv Detail & Related papers (2022-03-22T17:06:07Z) - Coarse-to-Fine Vision Transformer [83.45020063642235]
We propose a coarse-to-fine vision transformer (CF-ViT) to relieve computational burden while retaining performance.
Our proposed CF-ViT is motivated by two important observations in modern ViT models.
Our CF-ViT reduces 53% FLOPs of LV-ViT, and also achieves 2.01x throughput.
arXiv Detail & Related papers (2022-03-08T02:57:49Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z) - Improved Transformer for High-Resolution GANs [69.42469272015481]
We introduce two key ingredients to Transformer to address this challenge.
We show in the experiments that the proposed HiT achieves state-of-the-art FID scores of 31.87 and 2.95 on unconditional ImageNet $128 times 128$ and FFHQ $256 times 256$, respectively.
arXiv Detail & Related papers (2021-06-14T17:39:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.