VMamba: Visual State Space Model
- URL: http://arxiv.org/abs/2401.10166v3
- Date: Sun, 26 May 2024 08:31:28 GMT
- Title: VMamba: Visual State Space Model
- Authors: Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Yunfan Liu,
- Abstract summary: VMamba is a vision backbone that works in linear time complexity.
At the core of VMamba lies a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module.
- Score: 92.83984290020891
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Designing computationally efficient network architectures persists as an ongoing necessity in computer vision. In this paper, we transplant Mamba, a state-space language model, into VMamba, a vision backbone that works in linear time complexity. At the core of VMamba lies a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module. By traversing along four scanning routes, SS2D helps bridge the gap between the ordered nature of 1D selective scan and the non-sequential structure of 2D vision data, which facilitates the gathering of contextual information from various sources and perspectives. Based on the VSS blocks, we develop a family of VMamba architectures and accelerate them through a succession of architectural and implementation enhancements. Extensive experiments showcase VMamba's promising performance across diverse visual perception tasks, highlighting its advantages in input scaling efficiency compared to existing benchmark models. Source code is available at https://github.com/MzeroMiko/VMamba.
Related papers
- DAMamba: Vision State Space Model with Dynamic Adaptive Scan [51.81060691414399]
State space models (SSMs) have recently garnered significant attention in computer vision.
We propose Dynamic Adaptive Scan (DAS), a data-driven method that adaptively allocates scanning orders and regions.
Based on DAS, we propose the vision backbone DAMamba, which significantly outperforms current state-of-the-art vision Mamba models in vision tasks.
arXiv Detail & Related papers (2025-02-18T08:12:47Z) - Mamba2D: A Natively Multi-Dimensional State-Space Model for Vision Tasks [47.49096400786856]
State-Space Models (SSMs) have recently emerged as a powerful and efficient alternative to the long-standing transformer architecture.
We re-deriving modern selective state-space techniques, starting from a multidimensional formulation.
Mamba2D shows comparable performance to prior adaptations of SSMs for vision tasks, on standard image classification evaluations with the ImageNet-1K dataset.
arXiv Detail & Related papers (2024-12-20T18:50:36Z) - Spatial-Mamba: Effective Visual State Space Models via Structure-Aware State Fusion [46.82975707531064]
Selective state space models (SSMs) excel at capturing long-range dependencies in 1D sequential data.
We propose Spatial-Mamba, a novel approach that establishes neighborhood connectivity directly in the state space.
We show that Spatial-Mamba, even with a single scan, attains or surpasses the state-of-the-art SSM-based models in image classification, detection and segmentation.
arXiv Detail & Related papers (2024-10-19T12:56:58Z) - V2M: Visual 2-Dimensional Mamba for Image Representation Learning [68.51380287151927]
Mamba has garnered widespread attention due to its flexible design and efficient hardware performance to process 1D sequences.
Recent studies have attempted to apply Mamba to the visual domain by flattening 2D images into patches and then regarding them as a 1D sequence.
We propose a Visual 2-Dimensional Mamba model as a complete solution, which directly processes image tokens in the 2D space.
arXiv Detail & Related papers (2024-10-14T11:11:06Z) - MHS-VM: Multi-Head Scanning in Parallel Subspaces for Vision Mamba [0.43512163406552]
State Space Models (SSMs) with Mamba have shown great promise for long-range dependency modeling with linear complexity.
To effectively organize and construct visual features within the 2D image space through 1D selective scan, we propose a novel Multi-Head Scan (MHS) module.
The resulting sub-embeddings, obtained from the multi-head scan process, are then integrated and ultimately projected back into the high-dimensional space.
arXiv Detail & Related papers (2024-06-10T03:24:43Z) - Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model [48.233300343211205]
We propose a new generic vision backbone with bidirectional Mamba blocks (Vim)
Vim marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models.
The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images.
arXiv Detail & Related papers (2024-01-17T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.