Related papers: ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention

ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention

URL: http://arxiv.org/abs/2405.18425v2
Date: Wed, 29 May 2024 02:06:30 GMT
Title: ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention
Authors: Bencheng Liao, Xinggang Wang, Lianghui Zhu, Qian Zhang, Chang Huang,
Abstract summary: We introduce Gated Linear Attention (GLA) for vision, leveraging its superior hardware-awareness and efficiency. Our proposed model, ViG, offers a favorable trade-off in accuracy, parameters, and FLOPs on ImageNet and downstream tasks. ViG-T uses 5.2$times$ fewer FLOPs, saves 90% GPU memory, runs 4.8$times$ faster, and achieves 20.7% higher top-1 accuracy than DeiT-T.
Score: 33.00435765051738
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recently, linear complexity sequence modeling networks have achieved modeling capabilities similar to Vision Transformers on a variety of computer vision tasks, while using fewer FLOPs and less memory. However, their advantage in terms of actual runtime speed is not significant. To address this issue, we introduce Gated Linear Attention (GLA) for vision, leveraging its superior hardware-awareness and efficiency. We propose direction-wise gating to capture 1D global context through bidirectional modeling and a 2D gating locality injection to adaptively inject 2D local details into 1D global context. Our hardware-aware implementation further merges forward and backward scanning into a single kernel, enhancing parallelism and reducing memory cost and latency. The proposed model, ViG, offers a favorable trade-off in accuracy, parameters, and FLOPs on ImageNet and downstream tasks, outperforming popular Transformer and CNN-based models. Notably, ViG-S matches DeiT-B's accuracy while using only 27% of the parameters and 20% of the FLOPs, running 2$\times$ faster on $224\times224$ images. At $1024\times1024$ resolution, ViG-T uses 5.2$\times$ fewer FLOPs, saves 90% GPU memory, runs 4.8$\times$ faster, and achieves 20.7% higher top-1 accuracy than DeiT-T. These results position ViG as an efficient and scalable solution for visual representation learning. Code is available at \url{https://github.com/hustvl/ViG}.

Related papers

The Linear Attention Resurrection in Vision Transformer [0.6798775532273751]
Vision Transformers (ViTs) have recently taken computer vision by storm. Softmax attention underlying ViTs comes with a quadratic complexity in time and memory, hindering the application of ViTs to high-resolution images. We propose a linear attention method to address the limitation, which doesn't sacrifice ViT's core advantage of capturing global representation.
arXiv Detail & Related papers (2025-01-27T16:29:17Z)
DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention [82.24166963631949]
We introduce Diffusion Gated Linear Attention Transformers (DiG), a simple, adoptable solution with minimal parameter overhead, following the Diffusion Transformers (DiT) design. In addition to better performance than DiT, DiG-S/2 exhibits $2.5times$ higher training speed than DiT-S/2 and saves $75.7%$ memory resolution $179times 1792$. With the same model size, DiG-XL/2 is $4.2times$ faster than the recent Mamba-based diffusion model at a $1024$ resolution, and is $1.8times$ faster than DiT with FlashAttention-2
arXiv Detail & Related papers (2024-05-28T17:59:33Z)
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model [48.233300343211205]
We propose a new generic vision backbone with bidirectional Mamba blocks (Vim) Vim marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models. The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images.
arXiv Detail & Related papers (2024-01-17T18:56:18Z)
Semantic Segmentation in Satellite Hyperspectral Imagery by Deep Learning [54.094272065609815]
We propose a lightweight 1D-CNN model, 1D-Justo-LiuNet, which outperforms state-of-the-art models in the hypespectral domain. 1D-Justo-LiuNet achieves the highest accuracy (0.93) with the smallest model size (4,563 parameters) among all tested models.
arXiv Detail & Related papers (2023-10-24T21:57:59Z)
Accelerating Generic Graph Neural Networks via Architecture, Compiler, Partition Method Co-Design [15.500725014235412]
Graph neural networks (GNNs) have shown significant accuracy improvements in a variety of graph learning domains. It is essential to develop high-performance and efficient hardware acceleration for GNN models. Designers face two fundamental challenges: the high bandwidth requirement of GNN models and the diversity of GNN models.
arXiv Detail & Related papers (2023-08-16T07:05:47Z)
Boost Vision Transformer with GPU-Friendly Sparsity and Quantization [29.96026533220083]
This paper thoroughly designs a compression scheme to maximally utilize the GPU-friendly 2:4 fine-grained structured sparsity and quantization. Experiment results show GPUSQ-ViT scheme achieves state-of-the-art compression by reducing vision transformer models 6.4-12.7 times on model size and 30.3-62 times on FLOPs.
arXiv Detail & Related papers (2023-05-18T05:55:48Z)
SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer [42.440822037774645]
We introduce SparseViT that revisits activation sparsity for recent window-based vision transformers (ViTs) SparseViT achieves speedups of 1.5x, 1.4x, and 1.3x compared to its dense counterpart in monocular 3D object detection, 2D instance segmentation, and 2D semantic segmentation.
arXiv Detail & Related papers (2023-03-30T17:59:58Z)
Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks. We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs. Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z)
Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks. It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations. Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z)
Early Convolutions Help Transformers See Better [63.21712652156238]
Vision transformer (ViT) models exhibit substandard optimizability. Modern convolutional neural networks are far easier to optimize. Using a convolutional stem in ViT dramatically increases optimization stability and also improves peak performance.
arXiv Detail & Related papers (2021-06-28T17:59:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.