Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like
Architectures
- URL: http://arxiv.org/abs/2403.02308v2
- Date: Thu, 7 Mar 2024 15:43:08 GMT
- Title: Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like
Architectures
- Authors: Yuchen Duan, Weiyun Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Yu
Qiao, Hongsheng Li, Jifeng Dai, Wenhai Wang
- Abstract summary: This paper introduces Vision-RWKV, a model adapted from the RWKV model used in the NLP field.
Our model is designed to efficiently handle sparse inputs and demonstrate robust global processing capabilities.
Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage.
- Score: 99.20299078655376
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have revolutionized computer vision and natural language
processing, but their high computational complexity limits their application in
high-resolution image processing and long-context analysis. This paper
introduces Vision-RWKV (VRWKV), a model adapted from the RWKV model used in the
NLP field with necessary modifications for vision tasks. Similar to the Vision
Transformer (ViT), our model is designed to efficiently handle sparse inputs
and demonstrate robust global processing capabilities, while also scaling up
effectively, accommodating both large-scale parameters and extensive datasets.
Its distinctive advantage lies in its reduced spatial aggregation complexity,
which renders it exceptionally adept at processing high-resolution images
seamlessly, eliminating the necessity for windowing operations. Our evaluations
demonstrate that VRWKV surpasses ViT's performance in image classification and
has significantly faster speeds and lower memory usage processing
high-resolution inputs. In dense prediction tasks, it outperforms window-based
models, maintaining comparable speeds. These results highlight VRWKV's
potential as a more efficient alternative for visual perception tasks. Code is
released at \url{https://github.com/OpenGVLab/Vision-RWKV}.
Related papers
- VisualRWKV-HD and UHD: Advancing High-Resolution Processing for Visual Language Models [1.03590082373586]
We present VisualRWKV-HD and VisualRWKV-UHD, two advancements in the VisualRWKV model family, specifically designed to process high-resolution visual inputs.
Both models support resolutions up to 4096 x 4096 pixels, offering a more detailed and comprehensive visual processing capability.
arXiv Detail & Related papers (2024-10-15T14:49:19Z) - ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models [77.59651787115546]
High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity.
We propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM.
ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens.
arXiv Detail & Related papers (2024-05-24T17:34:15Z) - Efficient Visual State Space Model for Image Deblurring [83.57239834238035]
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration.
We propose a simple yet effective visual state space model (EVSSM) for image deblurring.
arXiv Detail & Related papers (2024-05-23T09:13:36Z) - Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models [33.372947082734946]
This paper introduces a series of architectures adapted from the RWKV model used in the NLP, with requisite modifications tailored for diffusion model applied to image generation tasks.
Our model is designed to efficiently handle patchnified inputs in a sequence with extra conditions, while also scaling up effectively.
Its distinctive advantage manifests in its reduced spatial aggregation complexity, rendering it exceptionally adept at processing high-resolution images.
arXiv Detail & Related papers (2024-04-06T02:54:35Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z) - ViR: Towards Efficient Vision Retention Backbones [97.93707844681893]
We propose a new class of computer vision models, dubbed Vision Retention Networks (ViR)
ViR has dual parallel and recurrent formulations, which strike an optimal balance between fast inference and parallel training with competitive performance.
We have validated the effectiveness of ViR through extensive experiments with different dataset sizes and various image resolutions.
arXiv Detail & Related papers (2023-10-30T16:55:50Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.