RMT: Retentive Networks Meet Vision Transformers
- URL: http://arxiv.org/abs/2309.11523v5
- Date: Sat, 2 Dec 2023 06:23:09 GMT
- Title: RMT: Retentive Networks Meet Vision Transformers
- Authors: Qihang Fan, Huaibo Huang, Mingrui Chen, Hongmin Liu and Ran He
- Abstract summary: Vision Transformer (ViT) has gained increasing attention in the computer vision community in recent years.
Self-Attention lacks explicit spatial priors and bears a quadratic computational complexity.
We propose RMT, a strong vision backbone with explicit spatial prior for general purposes.
- Score: 59.827563438653975
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformer (ViT) has gained increasing attention in the computer
vision community in recent years. However, the core component of ViT,
Self-Attention, lacks explicit spatial priors and bears a quadratic
computational complexity, thereby constraining the applicability of ViT. To
alleviate these issues, we draw inspiration from the recent Retentive Network
(RetNet) in the field of NLP, and propose RMT, a strong vision backbone with
explicit spatial prior for general purposes. Specifically, we extend the
RetNet's temporal decay mechanism to the spatial domain, and propose a spatial
decay matrix based on the Manhattan distance to introduce the explicit spatial
prior to Self-Attention. Additionally, an attention decomposition form that
adeptly adapts to explicit spatial prior is proposed, aiming to reduce the
computational burden of modeling global information without disrupting the
spatial decay matrix. Based on the spatial decay matrix and the attention
decomposition form, we can flexibly integrate explicit spatial prior into the
vision backbone with linear complexity. Extensive experiments demonstrate that
RMT exhibits exceptional performance across various vision tasks. Specifically,
without extra training data, RMT achieves **84.8%** and **86.1%** top-1 acc on
ImageNet-1k with **27M/4.5GFLOPs** and **96M/18.2GFLOPs**. For downstream
tasks, RMT achieves **54.5** box AP and **47.2** mask AP on the COCO detection
task, and **52.8** mIoU on the ADE20K semantic segmentation task. Code is
available at https://github.com/qhfan/RMT
Related papers
- Vision Transformer with Sparse Scan Prior [57.37893387775829]
Inspired by the human eye's sparse scanning mechanism, we propose a textbfSparse textbfScan textbfSelf-textbfAttention mechanism.
This mechanism predefines a series of Anchors of Interest for each token and employs local attention to efficiently model the spatial information around these anchors.
Building on $rmS3rmA$, we introduce the textbfSparse textbfScan textbfVision
arXiv Detail & Related papers (2024-05-22T04:34:36Z) - SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction [15.331332063879342]
We propose SparseOcc, an efficient occupancy network inspired by sparse point cloud processing.
SparseOcc achieves a remarkable 74.9% reduction on FLOPs over the dense baseline.
It also improves accuracy, from 12.8% to 14.1% mIOU, which in part can be attributed to the sparse representation's ability to avoid hallucinations on empty voxels.
arXiv Detail & Related papers (2024-04-15T06:45:06Z) - ACC-ViT : Atrous Convolution's Comeback in Vision Transformers [5.224344210588584]
We introduce Atrous Attention, a fusion of regional and sparse attention, which can adaptively consolidate both local and global information.
We also propose a general vision transformer backbone, named ACC-ViT, following conventional practices for standard vision tasks.
ACC-ViT is therefore a strong vision backbone, which is also competitive in mobile-scale versions, ideal for niche applications with small datasets.
arXiv Detail & Related papers (2024-03-07T04:05:16Z) - Learning Spatial-Temporal Regularized Tensor Sparse RPCA for Background
Subtraction [6.825970634402847]
We present a spatial-temporal regularized tensor sparse RPCA algorithm for precise background subtraction.
Experiments are performed on six publicly available background subtraction datasets.
arXiv Detail & Related papers (2023-09-27T11:21:31Z) - RFAConv: Innovating Spatial Attention and Standard Convolutional Operation [7.2646541547165056]
We propose a novel attention mechanism called Receptive-Field Attention (RFA)
RFA focuses on the receptive-field spatial feature but also provides effective attention weights for large-size convolutional kernels.
It offers nearly negligible increment of computational cost and parameters, while significantly improving network performance.
arXiv Detail & Related papers (2023-04-06T16:21:56Z) - Global Context Vision Transformers [78.5346173956383]
We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision.
We address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture.
Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks.
arXiv Detail & Related papers (2022-06-20T18:42:44Z) - UniFormer: Unifying Convolution and Self-attention for Visual
Recognition [69.68907941116127]
Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years.
We propose a novel Unified transFormer (UniFormer) which seamlessly integrates the merits of convolution and self-attention in a concise transformer format.
Our UniFormer achieves 86.3 top-1 accuracy on ImageNet-1K classification.
arXiv Detail & Related papers (2022-01-24T04:39:39Z) - Self-Supervised Pre-Training for Transformer-Based Person
Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID)
Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance.
This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z) - Towards Accurate Pixel-wise Object Tracking by Attention Retrieval [50.06436600343181]
We propose an attention retrieval network (ARN) to perform soft spatial constraints on backbone features.
We set a new state-of-the-art on recent pixel-wise object tracking benchmark VOT 2020 while running at 40 fps.
arXiv Detail & Related papers (2020-08-06T16:25:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.