Fast Vision Transformers with HiLo Attention
- URL: http://arxiv.org/abs/2205.13213v5
- Date: Wed, 19 Apr 2023 12:04:13 GMT
- Title: Fast Vision Transformers with HiLo Attention
- Authors: Zizheng Pan, Jianfei Cai, Bohan Zhuang
- Abstract summary: Vision Transformers (ViTs) have triggered the most recent and significant breakthroughs in computer vision.
We introduce LITv2, a simple and effective ViT which performs favourably against the existing state-of-the-art methods.
Powered by HiLo, LITv2 serves as a strong backbone for mainstream vision tasks including image classification, dense detection and segmentation.
- Score: 40.8842135978138
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformers (ViTs) have triggered the most recent and significant
breakthroughs in computer vision. Their efficient designs are mostly guided by
the indirect metric of computational complexity, i.e., FLOPs, which however has
a clear gap with the direct metric such as throughput. Thus, we propose to use
the direct speed evaluation on the target platform as the design principle for
efficient ViTs. Particularly, we introduce LITv2, a simple and effective ViT
which performs favourably against the existing state-of-the-art methods across
a spectrum of different model sizes with faster speed. At the core of LITv2 is
a novel self-attention mechanism, which we dub HiLo. HiLo is inspired by the
insight that high frequencies in an image capture local fine details and low
frequencies focus on global structures, whereas a multi-head self-attention
layer neglects the characteristic of different frequencies. Therefore, we
propose to disentangle the high/low frequency patterns in an attention layer by
separating the heads into two groups, where one group encodes high frequencies
via self-attention within each local window, and another group encodes low
frequencies by performing global attention between the average-pooled
low-frequency keys and values from each window and each query position in the
input feature map. Benefiting from the efficient design for both groups, we
show that HiLo is superior to the existing attention mechanisms by
comprehensively benchmarking FLOPs, speed and memory consumption on GPUs and
CPUs. For example, HiLo is 1.4x faster than spatial reduction attention and
1.6x faster than local window attention on CPUs. Powered by HiLo, LITv2 serves
as a strong backbone for mainstream vision tasks including image
classification, dense detection and segmentation. Code is available at
https://github.com/ziplab/LITv2.
Related papers
- The Linear Attention Resurrection in Vision Transformer [0.6798775532273751]
Vision Transformers (ViTs) have recently taken computer vision by storm.
Softmax attention underlying ViTs comes with a quadratic complexity in time and memory, hindering the application of ViTs to high-resolution images.
We propose a linear attention method to address the limitation, which doesn't sacrifice ViT's core advantage of capturing global representation.
arXiv Detail & Related papers (2025-01-27T16:29:17Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z) - FasterViT: Fast Vision Transformers with Hierarchical Attention [63.50580266223651]
We design a new family of hybrid CNN-ViT neural networks, named FasterViT, with a focus on high image throughput for computer vision (CV) applications.
Our newly introduced Hierarchical Attention (HAT) approach decomposes global self-attention with quadratic complexity into a multi-level attention with reduced computational costs.
arXiv Detail & Related papers (2023-06-09T18:41:37Z) - Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference [33.69340426607746]
Vision Transformers (ViTs) have shown impressive performance but still require a high computation cost as compared to convolutional neural networks (CNNs)
Existing efficient ViTs adopt local attention (e.g., Swin) or linear attention (e.g., Performer)
We propose a framework called Castling-ViT, which trains ViTs using both linear-angular attention and masked softmax-based quadratic attention.
arXiv Detail & Related papers (2022-11-18T22:49:04Z) - Green Hierarchical Vision Transformer for Masked Image Modeling [54.14989750044489]
We present an efficient approach for Masked Image Modeling with hierarchical Vision Transformers (ViTs)
We design a Group Window Attention scheme following the Divide-and-Conquer strategy.
We further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall cost of the attention on the grouped patches.
arXiv Detail & Related papers (2022-05-26T17:34:42Z) - Inception Transformer [151.939077819196]
Inception Transformer, or iFormer, learns comprehensive features with both high- and low-frequency information in visual data.
We benchmark the iFormer on a series of vision tasks, and showcase that it achieves impressive performance on image classification, COCO detection and ADE20K segmentation.
arXiv Detail & Related papers (2022-05-25T17:59:54Z) - Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain
Analysis: From Theory to Practice [111.47461527901318]
Vision Transformer (ViT) has recently demonstrated promise in computer vision problems.
ViT saturates quickly with depth increasing, due to the observed attention collapse or patch uniformity.
We propose two techniques to mitigate the undesirable low-pass limitation.
arXiv Detail & Related papers (2022-03-09T23:55:24Z) - Learned Queries for Efficient Local Attention [11.123272845092611]
Self-attention mechanism in vision transformers suffers from high latency and inefficient memory utilization.
We propose a new shift-invariant local attention layer, called query and attend (QnA), that aggregates the input locally in an overlapping manner.
We show improvements in speed and memory complexity while achieving comparable accuracy with state-of-the-art models.
arXiv Detail & Related papers (2021-12-21T18:52:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.