ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention
- URL: http://arxiv.org/abs/2405.18425v2
- Date: Wed, 29 May 2024 02:06:30 GMT
- Title: ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention
- Authors: Bencheng Liao, Xinggang Wang, Lianghui Zhu, Qian Zhang, Chang Huang,
- Abstract summary: We introduce Gated Linear Attention (GLA) for vision, leveraging its superior hardware-awareness and efficiency.
Our proposed model, ViG, offers a favorable trade-off in accuracy, parameters, and FLOPs on ImageNet and downstream tasks.
ViG-T uses 5.2$times$ fewer FLOPs, saves 90% GPU memory, runs 4.8$times$ faster, and achieves 20.7% higher top-1 accuracy than DeiT-T.
- Score: 33.00435765051738
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recently, linear complexity sequence modeling networks have achieved modeling capabilities similar to Vision Transformers on a variety of computer vision tasks, while using fewer FLOPs and less memory. However, their advantage in terms of actual runtime speed is not significant. To address this issue, we introduce Gated Linear Attention (GLA) for vision, leveraging its superior hardware-awareness and efficiency. We propose direction-wise gating to capture 1D global context through bidirectional modeling and a 2D gating locality injection to adaptively inject 2D local details into 1D global context. Our hardware-aware implementation further merges forward and backward scanning into a single kernel, enhancing parallelism and reducing memory cost and latency. The proposed model, ViG, offers a favorable trade-off in accuracy, parameters, and FLOPs on ImageNet and downstream tasks, outperforming popular Transformer and CNN-based models. Notably, ViG-S matches DeiT-B's accuracy while using only 27% of the parameters and 20% of the FLOPs, running 2$\times$ faster on $224\times224$ images. At $1024\times1024$ resolution, ViG-T uses 5.2$\times$ fewer FLOPs, saves 90% GPU memory, runs 4.8$\times$ faster, and achieves 20.7% higher top-1 accuracy than DeiT-T. These results position ViG as an efficient and scalable solution for visual representation learning. Code is available at \url{https://github.com/hustvl/ViG}.
Related papers
- The Linear Attention Resurrection in Vision Transformer [0.6798775532273751]
Vision Transformers (ViTs) have recently taken computer vision by storm.
Softmax attention underlying ViTs comes with a quadratic complexity in time and memory, hindering the application of ViTs to high-resolution images.
We propose a linear attention method to address the limitation, which doesn't sacrifice ViT's core advantage of capturing global representation.
arXiv Detail & Related papers (2025-01-27T16:29:17Z) - 2DMamba: Efficient State Space Model for Image Representation with Applications on Giga-Pixel Whole Slide Image Classification [40.10133518650528]
Mamba introduced a selective State Space Model (SSM) with linear complexity and high parallelism.
We propose 2DMamba, a novel 2D selective SSM framework that incorporates the 2D spatial structure of images into Mamba.
Experiments on 10 public datasets for WSI classification and survival analysis show that 2DMambaimproves up to $2.48%$ in AUC, $3.11%$ in F1 score, $2.47%$ in accuracy and $5.52%$ in C-index.
arXiv Detail & Related papers (2024-12-01T05:42:58Z) - DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention [82.24166963631949]
Diffusion Gated Linear Attention Transformers (DiG) is a simple, adoptable solution with minimal parameter overhead.
We offer two variants, i,e, a plain and U-shape architecture, showing superior efficiency and competitive effectiveness.
arXiv Detail & Related papers (2024-05-28T17:59:33Z) - Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model [48.233300343211205]
We propose a new generic vision backbone with bidirectional Mamba blocks (Vim)
Vim marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models.
The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images.
arXiv Detail & Related papers (2024-01-17T18:56:18Z) - Semantic Segmentation in Satellite Hyperspectral Imagery by Deep Learning [54.094272065609815]
We propose a lightweight 1D-CNN model, 1D-Justo-LiuNet, which outperforms state-of-the-art models in the hypespectral domain.
1D-Justo-LiuNet achieves the highest accuracy (0.93) with the smallest model size (4,563 parameters) among all tested models.
arXiv Detail & Related papers (2023-10-24T21:57:59Z) - SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution
Vision Transformer [42.440822037774645]
We introduce SparseViT that revisits activation sparsity for recent window-based vision transformers (ViTs)
SparseViT achieves speedups of 1.5x, 1.4x, and 1.3x compared to its dense counterpart in monocular 3D object detection, 2D instance segmentation, and 2D semantic segmentation.
arXiv Detail & Related papers (2023-03-30T17:59:58Z) - Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks.
We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs.
Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z) - Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks.
It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z) - Early Convolutions Help Transformers See Better [63.21712652156238]
Vision transformer (ViT) models exhibit substandard optimizability.
Modern convolutional neural networks are far easier to optimize.
Using a convolutional stem in ViT dramatically increases optimization stability and also improves peak performance.
arXiv Detail & Related papers (2021-06-28T17:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.