ViR: Towards Efficient Vision Retention Backbones
- URL: http://arxiv.org/abs/2310.19731v2
- Date: Fri, 26 Jan 2024 18:57:35 GMT
- Title: ViR: Towards Efficient Vision Retention Backbones
- Authors: Ali Hatamizadeh, Michael Ranzinger, Shiyi Lan, Jose M. Alvarez, Sanja
Fidler, Jan Kautz
- Abstract summary: We propose a new class of computer vision models, dubbed Vision Retention Networks (ViR)
ViR has dual parallel and recurrent formulations, which strike an optimal balance between fast inference and parallel training with competitive performance.
We have validated the effectiveness of ViR through extensive experiments with different dataset sizes and various image resolutions.
- Score: 97.93707844681893
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision Transformers (ViTs) have attracted a lot of popularity in recent
years, due to their exceptional capabilities in modeling long-range spatial
dependencies and scalability for large scale training. Although the training
parallelism of self-attention mechanism plays an important role in retaining
great performance, its quadratic complexity baffles the application of ViTs in
many scenarios which demand fast inference. This effect is even more pronounced
in applications in which autoregressive modeling of input features is required.
In Natural Language Processing (NLP), a new stream of efforts has proposed
parallelizable models with recurrent formulation that allows for efficient
inference in generative applications. Inspired by this trend, we propose a new
class of computer vision models, dubbed Vision Retention Networks (ViR), with
dual parallel and recurrent formulations, which strike an optimal balance
between fast inference and parallel training with competitive performance. In
particular, ViR scales favorably for image throughput and memory consumption in
tasks that require higher-resolution images due to its flexible formulation in
processing large sequence lengths. The ViR is the first attempt to realize dual
parallel and recurrent equivalency in a general vision backbone for recognition
tasks. We have validated the effectiveness of ViR through extensive experiments
with different dataset sizes and various image resolutions and achieved
competitive performance. Code: https://github.com/NVlabs/ViR
Related papers
- LaVin-DiT: Large Vision Diffusion Transformer [99.98106406059333]
LaVin-DiT is a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework.
We introduce key innovations to optimize generative performance for vision tasks.
The model is scaled from 0.1B to 3.4B parameters, demonstrating substantial scalability and state-of-the-art performance across diverse vision tasks.
arXiv Detail & Related papers (2024-11-18T12:05:27Z) - ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models.
Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z) - iiANET: Inception Inspired Attention Hybrid Network for efficient Long-Range Dependency [0.0]
We introduce iiANET (Inception Inspired Attention Network), an efficient hybrid model designed to capture long-range dependencies in complex images.
The fundamental building block, iiABlock, integrates global 2D-MHSA (Multi-Head Self-Attention) with Registers, MBConv2 (MobileNetV2-based convolution), and dilated convolution in parallel.
We serially integrate an ECANET (Efficient Channel Attention Network) at the end of each iiABlock to calibrate channel-wise attention for enhanced model performance.
arXiv Detail & Related papers (2024-07-10T12:39:02Z) - Efficient Visual State Space Model for Image Deblurring [83.57239834238035]
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration.
We propose a simple yet effective visual state space model (EVSSM) for image deblurring.
arXiv Detail & Related papers (2024-05-23T09:13:36Z) - CViT: Continuous Vision Transformer for Operator Learning [24.1795082775376]
Continuous Vision Transformer (CViT) is a novel neural operator architecture that leverages advances in computer vision to address challenges in learning complex physical systems.
CViT combines a vision transformer encoder, a novel grid-based coordinate embedding, and a query-wise cross-attention mechanism to effectively capture multi-scale dependencies.
We demonstrate CViT's effectiveness across a diverse range of partial differential equation (PDE) systems, including fluid dynamics, climate modeling, and reaction-diffusion processes.
arXiv Detail & Related papers (2024-05-22T21:13:23Z) - Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like
Architectures [99.20299078655376]
This paper introduces Vision-RWKV, a model adapted from the RWKV model used in the NLP field.
Our model is designed to efficiently handle sparse inputs and demonstrate robust global processing capabilities.
Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage.
arXiv Detail & Related papers (2024-03-04T18:46:20Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - Sequencer: Deep LSTM for Image Classification [0.0]
In recent computer vision research, the advent of the Vision Transformer (ViT) has rapidly revolutionized various architectural design efforts.
We propose Sequencer, a novel and competitive architecture alternative to ViT.
Despite its simplicity, several experiments demonstrate that Sequencer performs impressively well.
arXiv Detail & Related papers (2022-05-04T09:47:46Z) - Multi-Scale Vision Longformer: A New Vision Transformer for
High-Resolution Image Encoding [81.07894629034767]
This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer.
It significantly enhances the ViT of citedosovitskiy 2020image for encoding high-resolution images using two techniques.
arXiv Detail & Related papers (2021-03-29T06:23:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.