Related papers: V2M: Visual 2-Dimensional Mamba for Image Representation Learning

V2M: Visual 2-Dimensional Mamba for Image Representation Learning

URL: http://arxiv.org/abs/2410.10382v1
Date: Mon, 14 Oct 2024 11:11:06 GMT
Title: V2M: Visual 2-Dimensional Mamba for Image Representation Learning
Authors: Chengkun Wang, Wenzhao Zheng, Yuanhui Huang, Jie Zhou, Jiwen Lu,
Abstract summary: Mamba has garnered widespread attention due to its flexible design and efficient hardware performance to process 1D sequences. Recent studies have attempted to apply Mamba to the visual domain by flattening 2D images into patches and then regarding them as a 1D sequence. We propose a Visual 2-Dimensional Mamba model as a complete solution, which directly processes image tokens in the 2D space.
Score: 68.51380287151927
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mamba has garnered widespread attention due to its flexible design and efficient hardware performance to process 1D sequences based on the state space model (SSM). Recent studies have attempted to apply Mamba to the visual domain by flattening 2D images into patches and then regarding them as a 1D sequence. To compensate for the 2D structure information loss (e.g., local similarity) of the original image, most existing methods focus on designing different orders to sequentially process the tokens, which could only alleviate this issue to some extent. In this paper, we propose a Visual 2-Dimensional Mamba (V2M) model as a complete solution, which directly processes image tokens in the 2D space. We first generalize SSM to the 2-dimensional space which generates the next state considering two adjacent states on both dimensions (e.g., columns and rows). We then construct our V2M based on the 2-dimensional SSM formulation and incorporate Mamba to achieve hardware-efficient parallel processing. The proposed V2M effectively incorporates the 2D locality prior yet inherits the efficiency and input-dependent scalability of Mamba. Extensive experimental results on ImageNet classification and downstream visual tasks including object detection and instance segmentation on COCO and semantic segmentation on ADE20K demonstrate the effectiveness of our V2M compared with other visual backbones.

Related papers

DefMamba: Deformable Visual State Space Model [65.50381013020248]
We propose a novel visual foundation model called DefMamba. By combining a deformable scanning(DS) strategy, this model significantly improves its ability to learn image structures and detects changes in object details. Numerous experiments have shown that DefMamba achieves state-of-the-art performance in various visual tasks.
arXiv Detail & Related papers (2025-04-08T08:22:54Z)
Mamba2D: A Natively Multi-Dimensional State-Space Model for Vision Tasks [47.49096400786856]
State-Space Models (SSMs) have recently emerged as a powerful and efficient alternative to the long-standing transformer architecture. We re-deriving modern selective state-space techniques, starting from a multidimensional formulation. Mamba2D shows comparable performance to prior adaptations of SSMs for vision tasks, on standard image classification evaluations with the ImageNet-1K dataset.
arXiv Detail & Related papers (2024-12-20T18:50:36Z)
2DMamba: Efficient State Space Model for Image Representation with Applications on Giga-Pixel Whole Slide Image Classification [40.10133518650528]
We propose 2DMamba, a novel 2D selective SSM framework that incorporates the 2D spatial structure of images into Mamba. Experiments on 10 public datasets for WSI classification and survival analysis show that 2DMamba improves up to 2.48% in AUC, 3.11% in F1 score, 2.47% in accuracy and 5.52% in C-index.
arXiv Detail & Related papers (2024-12-01T05:42:58Z)
Spatial-Mamba: Effective Visual State Space Models via Structure-Aware State Fusion [46.82975707531064]
Selective state space models (SSMs) excel at capturing long-range dependencies in 1D sequential data. We propose Spatial-Mamba, a novel approach that establishes neighborhood connectivity directly in the state space. We show that Spatial-Mamba, even with a single scan, attains or surpasses the state-of-the-art SSM-based models in image classification, detection and segmentation.
arXiv Detail & Related papers (2024-10-19T12:56:58Z)
MSVM-UNet: Multi-Scale Vision Mamba UNet for Medical Image Segmentation [3.64388407705261]
We propose a Multi-Scale Vision Mamba UNet model for medical image segmentation, termed MSVM-UNet. Specifically, by introducing multi-scale convolutions in the VSS blocks, we can more effectively capture and aggregate multi-scale feature representations from the hierarchical features of the VMamba encoder.
arXiv Detail & Related papers (2024-08-25T06:20:28Z)
MHS-VM: Multi-Head Scanning in Parallel Subspaces for Vision Mamba [0.43512163406552]
State Space Models (SSMs) with Mamba have shown great promise for long-range dependency modeling with linear complexity. To effectively organize and construct visual features within the 2D image space through 1D selective scan, we propose a novel Multi-Head Scan (MHS) module. The resulting sub-embeddings, obtained from the multi-head scan process, are then integrated and ultimately projected back into the high-dimensional space.
arXiv Detail & Related papers (2024-06-10T03:24:43Z)
MambaVC: Learned Visual Compression with Selective State Spaces [74.29217829932895]
We introduce MambaVC, a simple, strong and efficient compression network based on SSM. MambaVC develops a visual state space (VSS) block with a 2D selective scanning (2DSS) module as the nonlinear activation function after each downsampling. On compression benchmark datasets, MambaVC achieves superior rate-distortion performance with lower computational and memory overheads.
arXiv Detail & Related papers (2024-05-24T10:24:30Z)
PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition [21.761988930589727]
PlainMamba is a simple non-hierarchical state space model (SSM) designed for general visual recognition. We adapt the selective scanning process of Mamba to the visual domain, enhancing its ability to learn features from two-dimensional images. Our architecture is designed to be easy to use and easy to scale, formed by stacking identical PlainMamba blocks.
arXiv Detail & Related papers (2024-03-26T13:35:10Z)
MiM-ISTD: Mamba-in-Mamba for Efficient Infrared Small Target Detection [72.46396769642787]
We develop a nested structure, Mamba-in-Mamba (MiM-ISTD), for efficient infrared small target detection. MiM-ISTD is $8 times$ faster than the SOTA method and reduces GPU memory usage by 62.2$%$ when testing on $2048 times 2048$ images.
arXiv Detail & Related papers (2024-03-04T15:57:29Z)
VMamba: Visual State Space Model [92.83984290020891]
VMamba is a vision backbone that works in linear time complexity. At the core of VMamba lies a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module.
arXiv Detail & Related papers (2024-01-18T17:55:39Z)
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model [48.233300343211205]
We propose a new generic vision backbone with bidirectional Mamba blocks (Vim) Vim marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models. The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images.
arXiv Detail & Related papers (2024-01-17T18:56:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.