LocalMamba: Visual State Space Model with Windowed Selective Scan
- URL: http://arxiv.org/abs/2403.09338v1
- Date: Thu, 14 Mar 2024 12:32:40 GMT
- Title: LocalMamba: Visual State Space Model with Windowed Selective Scan
- Authors: Tao Huang, Xiaohuan Pei, Shan You, Fei Wang, Chen Qian, Chang Xu,
- Abstract summary: Key to enhancing Vision Mamba (ViM) lies in optimizing scan directions for sequence modeling.
We introduce a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies.
Our model significantly outperforms Vim-Ti by 3.1% on ImageNet with the same 1.5G FLOPs.
- Score: 45.00004931200446
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in state space models, notably Mamba, have demonstrated significant progress in modeling long sequences for tasks like language understanding. Yet, their application in vision tasks has not markedly surpassed the performance of traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). This paper posits that the key to enhancing Vision Mamba (ViM) lies in optimizing scan directions for sequence modeling. Traditional ViM approaches, which flatten spatial tokens, overlook the preservation of local 2D dependencies, thereby elongating the distance between adjacent tokens. We introduce a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies while maintaining a global perspective. Additionally, acknowledging the varying preferences for scan patterns across different network layers, we propose a dynamic method to independently search for the optimal scan choices for each layer, substantially improving performance. Extensive experiments across both plain and hierarchical models underscore our approach's superiority in effectively capturing image representations. For example, our model significantly outperforms Vim-Ti by 3.1% on ImageNet with the same 1.5G FLOPs. Code is available at: https://github.com/hunto/LocalMamba.
Related papers
- V2M: Visual 2-Dimensional Mamba for Image Representation Learning [68.51380287151927]
Mamba has garnered widespread attention due to its flexible design and efficient hardware performance to process 1D sequences.
Recent studies have attempted to apply Mamba to the visual domain by flattening 2D images into patches and then regarding them as a 1D sequence.
We propose a Visual 2-Dimensional Mamba model as a complete solution, which directly processes image tokens in the 2D space.
arXiv Detail & Related papers (2024-10-14T11:11:06Z) - QuadMamba: Learning Quadtree-based Selective Scan for Visual State Space Model [16.01259690063522]
New vision Mamba model, coined QuadMamba, captures local dependencies of varying granularities via quadtree-based image partition and scan.
QuadMamba achieves state-of-the-art performance in various vision tasks, including image classification, object detection, instance segmentation, and semantic segmentation.
arXiv Detail & Related papers (2024-10-09T12:03:50Z) - MHS-VM: Multi-Head Scanning in Parallel Subspaces for Vision Mamba [0.43512163406552]
State Space Models (SSMs) with Mamba have shown great promise for long-range dependency modeling with linear complexity.
To effectively organize and construct visual features within the 2D image space through 1D selective scan, we propose a novel Multi-Head Scan (MHS) module.
The resulting sub-embeddings, obtained from the multi-head scan process, are then integrated and ultimately projected back into the high-dimensional space.
arXiv Detail & Related papers (2024-06-10T03:24:43Z) - Vim-F: Visual State Space Model Benefiting from Learning in the Frequency Domain [9.458951424465605]
State Space Models (SSMs) with efficient hardware-aware designs, known as the Mamba deep learning models, have made significant progress in modeling long sequences.
We propose a novel model called Vim-F, which employs pure Mamba encoders and scans in both the frequency and spatial domains.
arXiv Detail & Related papers (2024-05-29T01:01:19Z) - Efficient Visual State Space Model for Image Deblurring [83.57239834238035]
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration.
We propose a simple yet effective visual state space model (EVSSM) for image deblurring.
arXiv Detail & Related papers (2024-05-23T09:13:36Z) - PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition [21.761988930589727]
PlainMamba is a simple non-hierarchical state space model (SSM) designed for general visual recognition.
We adapt the selective scanning process of Mamba to the visual domain, enhancing its ability to learn features from two-dimensional images.
Our architecture is designed to be easy to use and easy to scale, formed by stacking identical PlainMamba blocks.
arXiv Detail & Related papers (2024-03-26T13:35:10Z) - The Hidden Attention of Mamba Models [54.50526986788175]
The Mamba layer offers an efficient selective state space model (SSM) that is highly effective in modeling multiple domains.
We show that such models can be viewed as attention-driven models.
This new perspective enables us to empirically and theoretically compare the underlying mechanisms to that of the self-attention layers in transformers.
arXiv Detail & Related papers (2024-03-03T18:58:21Z) - VMamba: Visual State Space Model [92.83984290020891]
VMamba is a vision backbone that works in linear time complexity.
At the core of VMamba lies a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module.
arXiv Detail & Related papers (2024-01-18T17:55:39Z) - Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model [48.233300343211205]
We propose a new generic vision backbone with bidirectional Mamba blocks (Vim)
Vim marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models.
The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images.
arXiv Detail & Related papers (2024-01-17T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.