Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like
Architectures
- URL: http://arxiv.org/abs/2403.02308v2
- Date: Thu, 7 Mar 2024 15:43:08 GMT
- Title: Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like
Architectures
- Authors: Yuchen Duan, Weiyun Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Yu
Qiao, Hongsheng Li, Jifeng Dai, Wenhai Wang
- Abstract summary: This paper introduces Vision-RWKV, a model adapted from the RWKV model used in the NLP field.
Our model is designed to efficiently handle sparse inputs and demonstrate robust global processing capabilities.
Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage.
- Score: 99.20299078655376
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have revolutionized computer vision and natural language
processing, but their high computational complexity limits their application in
high-resolution image processing and long-context analysis. This paper
introduces Vision-RWKV (VRWKV), a model adapted from the RWKV model used in the
NLP field with necessary modifications for vision tasks. Similar to the Vision
Transformer (ViT), our model is designed to efficiently handle sparse inputs
and demonstrate robust global processing capabilities, while also scaling up
effectively, accommodating both large-scale parameters and extensive datasets.
Its distinctive advantage lies in its reduced spatial aggregation complexity,
which renders it exceptionally adept at processing high-resolution images
seamlessly, eliminating the necessity for windowing operations. Our evaluations
demonstrate that VRWKV surpasses ViT's performance in image classification and
has significantly faster speeds and lower memory usage processing
high-resolution inputs. In dense prediction tasks, it outperforms window-based
models, maintaining comparable speeds. These results highlight VRWKV's
potential as a more efficient alternative for visual perception tasks. Code is
released at \url{https://github.com/OpenGVLab/Vision-RWKV}.
Related papers
- DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs [124.52164183968145]
We present DyMU, an efficient, training-free framework that reduces the computational burden of vision-language models (VLMs)
Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity.
Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence.
arXiv Detail & Related papers (2025-04-23T18:38:18Z) - ECViT: Efficient Convolutional Vision Transformer with Local-Attention and Multi-scale Stages [0.0]
Vision Transformers (ViTs) have revolutionized computer vision by leveraging self-attention to model long-range dependencies.
We propose the Efficient Convolutional Vision Transformer (ECViT), a hybrid architecture that effectively combines the strengths of CNNs and Transformers.
arXiv Detail & Related papers (2025-04-21T03:00:17Z) - VisualRWKV-HD and UHD: Advancing High-Resolution Processing for Visual Language Models [1.03590082373586]
We present VisualRWKV-HD and VisualRWKV-UHD, two advancements in the VisualRWKV model family, specifically designed to process high-resolution visual inputs.
Both models support resolutions up to 4096 x 4096 pixels, offering a more detailed and comprehensive visual processing capability.
arXiv Detail & Related papers (2024-10-15T14:49:19Z) - ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models [77.59651787115546]
High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity.
We propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM.
ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens.
arXiv Detail & Related papers (2024-05-24T17:34:15Z) - Efficient Visual State Space Model for Image Deblurring [83.57239834238035]
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration.
We propose a simple yet effective visual state space model (EVSSM) for image deblurring.
arXiv Detail & Related papers (2024-05-23T09:13:36Z) - Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models [33.372947082734946]
This paper introduces a series of architectures adapted from the RWKV model used in the NLP, with requisite modifications tailored for diffusion model applied to image generation tasks.
Our model is designed to efficiently handle patchnified inputs in a sequence with extra conditions, while also scaling up effectively.
Its distinctive advantage manifests in its reduced spatial aggregation complexity, rendering it exceptionally adept at processing high-resolution images.
arXiv Detail & Related papers (2024-04-06T02:54:35Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z) - ViR: Towards Efficient Vision Retention Backbones [97.93707844681893]
We propose a new class of computer vision models, dubbed Vision Retention Networks (ViR)
ViR has dual parallel and recurrent formulations, which strike an optimal balance between fast inference and parallel training with competitive performance.
We have validated the effectiveness of ViR through extensive experiments with different dataset sizes and various image resolutions.
arXiv Detail & Related papers (2023-10-30T16:55:50Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.