Related papers: A Survey on Mamba Architecture for Vision Applications

A Survey on Mamba Architecture for Vision Applications

URL: http://arxiv.org/abs/2502.07161v1
Date: Tue, 11 Feb 2025 00:59:30 GMT
Title: A Survey on Mamba Architecture for Vision Applications
Authors: Fady Ibrahim, Guangjun Liu, Guanghui Wang,
Abstract summary: Mamba architecture addresses scalability challenges in visual tasks.<n>Vision Mamba and VideoMamba introduce bidirectional scanning, selective mechanisms, andtemporal processing to enhance image and video understanding.<n>These advancements position Mamba as a promising architecture in computer vision research and applications.
Score: 7.216568558372857
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers have become foundational for visual tasks such as object detection, semantic segmentation, and video understanding, but their quadratic complexity in attention mechanisms presents scalability challenges. To address these limitations, the Mamba architecture utilizes state-space models (SSMs) for linear scalability, efficient processing, and improved contextual awareness. This paper investigates Mamba architecture for visual domain applications and its recent advancements, including Vision Mamba (ViM) and VideoMamba, which introduce bidirectional scanning, selective scanning mechanisms, and spatiotemporal processing to enhance image and video understanding. Architectural innovations like position embeddings, cross-scan modules, and hierarchical designs further optimize the Mamba framework for global and local feature extraction. These advancements position Mamba as a promising architecture in computer vision research and applications.

Related papers

Vision Mamba in Remote Sensing: A Comprehensive Survey of Techniques, Applications and Outlook [46.65330450810048]
State Space Models (SSMs) have emerged as a paradigm-shifting solution, combining linear computational scaling with global context modeling. This survey presents a comprehensive review of Mamba-based methodologies in remote sensing.
arXiv Detail & Related papers (2025-05-01T16:07:51Z)
DefMamba: Deformable Visual State Space Model [65.50381013020248]
We propose a novel visual foundation model called DefMamba. By combining a deformable scanning(DS) strategy, this model significantly improves its ability to learn image structures and detects changes in object details. Numerous experiments have shown that DefMamba achieves state-of-the-art performance in various visual tasks.
arXiv Detail & Related papers (2025-04-08T08:22:54Z)
TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba [88.31117598044725]
We explore cross-architecture training to transfer the ready knowledge in existing Transformer models to alternative architecture Mamba, termed TransMamba. Our approach employs a two-stage strategy to expedite training new Mamba models, ensuring effectiveness in across uni-modal and cross-modal tasks. For cross-modal learning, we propose a cross-Mamba module that integrates language awareness into Mamba's visual features, enhancing the cross-modal interaction capabilities of Mamba architecture.
arXiv Detail & Related papers (2025-02-21T01:22:01Z)
Mamba-SEUNet: Mamba UNet for Monaural Speech Enhancement [54.427965535613886]
Mamba, as a novel state-space model (SSM), has gained widespread application in natural language processing and computer vision.<n>In this work, we introduce Mamba-SEUNet, an innovative architecture that integrates Mamba with U-Net for SE tasks.
arXiv Detail & Related papers (2024-12-21T13:43:51Z)
Mamba in Vision: A Comprehensive Survey of Techniques and Applications [3.4580301733198446]
Mamba is emerging as a novel approach to overcome the challenges faced by Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in computer vision. Mamba addresses these limitations by leveraging Selective Structured State Space Models to effectively capture long-range dependencies with linear computational complexity.
arXiv Detail & Related papers (2024-10-04T02:58:49Z)
Distillation-free Scaling of Large SSMs for Images and Videos [27.604572990625144]
State-space models (SSMs) have introduced a novel context modeling method by integrating state-space techniques into deep learning. Mamba-based architectures are difficult to scale with respect to the number of parameters, which is a major limitation for vision applications. We propose a Mamba-Attention interleaved architecture that enhances scalability, robustness, and performance.
arXiv Detail & Related papers (2024-09-18T10:48:10Z)
MambaVision: A Hybrid Mamba-Transformer Vision Backbone [54.965143338206644]
We propose a novel hybrid Mamba-Transformer backbone, denoted as MambaVision, which is specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. We conduct a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba.
arXiv Detail & Related papers (2024-07-10T23:02:45Z)
A Survey on Visual Mamba [16.873917203618365]
State space models (SSMs) with selection mechanisms and hardware-aware architectures, namely Mamba, have recently demonstrated significant promise in long-sequence modeling. Since the self-attention mechanism in transformers has quadratic complexity with image size and increasing computational demands, the researchers are now exploring how to adapt Mamba for computer vision tasks. This paper is the first comprehensive survey aiming to provide an in-depth analysis of Mamba models in the field of computer vision.
arXiv Detail & Related papers (2024-04-24T16:23:34Z)
PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition [21.761988930589727]
PlainMamba is a simple non-hierarchical state space model (SSM) designed for general visual recognition. We adapt the selective scanning process of Mamba to the visual domain, enhancing its ability to learn features from two-dimensional images. Our architecture is designed to be easy to use and easy to scale, formed by stacking identical PlainMamba blocks.
arXiv Detail & Related papers (2024-03-26T13:35:10Z)
VMamba: Visual State Space Model [98.0517369083152]
We adapt Mamba, a state-space language model, into VMamba, a vision backbone with linear time complexity.<n>At the core of VMamba is a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module.
arXiv Detail & Related papers (2024-01-18T17:55:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.