Related papers: Mamba-R: Vision Mamba ALSO Needs Registers

Mamba-R: Vision Mamba ALSO Needs Registers

URL: http://arxiv.org/abs/2405.14858v1
Date: Thu, 23 May 2024 17:58:43 GMT
Title: Mamba-R: Vision Mamba ALSO Needs Registers
Authors: Feng Wang, Jiahao Wang, Sucheng Ren, Guoyizhe Wei, Jieru Mei, Wei Shao, Yuyin Zhou, Alan Yuille, Cihang Xie,
Abstract summary: Similar to Vision Transformers, this paper identifies artifacts also present within the feature maps of Vision Mamba. These artifacts, corresponding to high-norm tokens emerging in low-information background areas of images, appear much more severe in Vision Mamba. To mitigate this issue, we follow the prior solution of introducing register tokens into Vision Mamba.
Score: 45.41648622999754
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Similar to Vision Transformers, this paper identifies artifacts also present within the feature maps of Vision Mamba. These artifacts, corresponding to high-norm tokens emerging in low-information background areas of images, appear much more severe in Vision Mamba -- they exist prevalently even with the tiny-sized model and activate extensively across background regions. To mitigate this issue, we follow the prior solution of introducing register tokens into Vision Mamba. To better cope with Mamba blocks' uni-directional inference paradigm, two key modifications are introduced: 1) evenly inserting registers throughout the input token sequence, and 2) recycling registers for final decision predictions. We term this new architecture Mamba-R. Qualitative observations suggest, compared to vanilla Vision Mamba, Mamba-R's feature maps appear cleaner and more focused on semantically meaningful regions. Quantitatively, Mamba-R attains stronger performance and scales better. For example, on the ImageNet benchmark, our base-size Mamba-R attains 82.9% accuracy, significantly outperforming Vim-B's 81.8%; furthermore, we provide the first successful scaling to the large model size (i.e., with 341M parameters), attaining a competitive accuracy of 83.2% (84.5% if finetuned with 384x384 inputs). Additional validation on the downstream semantic segmentation task also supports Mamba-R's efficacy.

Related papers

RNN as Linear Transformer: A Closer Investigation into Representational Potentials of Visual Mamba Models [8.049668552887505]
Mamba has recently garnered attention as an effective backbone for vision tasks.<n>We make three primary contributions to investigate Mamba's representational properties.<n>Our model achieves a 78.5 percent linear probing accuracy on ImageNet, underscoring its strong performance.
arXiv Detail & Related papers (2025-11-23T09:57:27Z)
Mamba-OTR: a Mamba-based Solution for Online Take and Release Detection from Untrimmed Egocentric Video [57.805927523341516]
Mamba-OTR is designed to exploit temporal recurrence during inference while being trained on short video clips.<n>Mamba-OTR achieves a noteworthy mp-mAP of 45.48 when operating in a sliding-window fashion.<n>We will publicly release the source code of Mamba-OTR to support future research.
arXiv Detail & Related papers (2025-07-22T08:23:51Z)
Dynamic Vision Mamba [41.84910346271891]
Mamba-based vision models have gained extensive attention as a result of being computationally more efficient than attention-based models. For token redundancy, we analytically find that early token pruning methods will result in inconsistency between training and inference. For block redundancy, we allow each image to select SSM blocks dynamically based on an empirical observation that the inference speed of Mamba-based vision models is largely affected by the number of SSM blocks.
arXiv Detail & Related papers (2025-04-07T07:31:28Z)
Mamba meets crack segmentation [0.18416014644193066]
Cracks pose safety risks to infrastructure and cannot be overlooked. CNNs exhibit a deficiency in global modeling capability, hindering the representation to entire crack features. This study explores the representation capabilities of Mamba to crack features.
arXiv Detail & Related papers (2024-07-22T15:21:35Z)
MambaVision: A Hybrid Mamba-Transformer Vision Backbone [54.965143338206644]
We propose a novel hybrid Mamba-Transformer backbone, denoted as MambaVision, which is specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. We conduct a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba.
arXiv Detail & Related papers (2024-07-10T23:02:45Z)
Autoregressive Pretraining with Mamba in Vision [45.25546594814871]
This paper shows that Mamba's visual capability can be significantly enhanced through autoregressive pretraining. Performance-wise, autoregressive pretraining equips the Mamba architecture with markedly higher accuracy. Our huge-size Mamba attains 85.0% ImageNet accuracy when finetuned with $384times384$ inputs.
arXiv Detail & Related papers (2024-06-11T17:58:34Z)
Demystify Mamba in Vision: A Linear Attention Perspective [72.93213667713493]
Mamba is an effective state space model with linear computation complexity. We show that Mamba shares surprising similarities with linear attention Transformer. We propose a Mamba-Like Linear Attention (MLLA) model by incorporating the merits of these two key designs into linear attention.
arXiv Detail & Related papers (2024-05-26T15:31:09Z)
MambaOut: Do We Really Need Mamba for Vision? [70.60495392198686]
Mamba, an architecture with RNN-like token mixer of state space model (SSM), was recently introduced to address the quadratic complexity of the attention mechanism. This paper conceptually concludes that Mamba is ideally suited for tasks with long-sequence and autoregressive characteristics. We construct a series of models named MambaOut through stacking Mamba blocks while removing their core token mixer, SSM.
arXiv Detail & Related papers (2024-05-13T17:59:56Z)
An Investigation of Incorporating Mamba for Speech Enhancement [45.610243349192096]
We exploit a Mamba-based regression model to characterize speech signals and build an SE system upon Mamba, termed SEMamba. SEMamba demonstrates promising results and attains a PESQ score of 3.55 on the VoiceBank-DEMAND dataset.
arXiv Detail & Related papers (2024-05-10T16:18:49Z)
CLIP-Mamba: CLIP Pretrained Mamba Models with OOD and Hessian Evaluation [18.383760896304604]
This report introduces the first attempt to train a Mamba model utilizing contrastive technical-image pretraining (CLIP) A Mamba model 67 million parameters is on par with a 307 million- parameters Vision Transformer (ViT) model in zero-shot classification tasks.
arXiv Detail & Related papers (2024-04-30T09:40:07Z)
ReMamber: Referring Image Segmentation with Mamba Twister [51.291487576255435]
ReMamber is a novel RIS architecture that integrates the power of Mamba with a multi-modal Mamba Twister block. The Mamba Twister explicitly models image-text interaction, and fuses textual and visual features through its unique channel and spatial twisting mechanism.
arXiv Detail & Related papers (2024-03-26T16:27:37Z)
Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining [85.08169822181685]
This paper introduces a novel Mamba-based model, Swin-UMamba, designed specifically for medical image segmentation tasks. Swin-UMamba demonstrates superior performance with a large margin compared to CNNs, ViTs, and latest Mamba-based models.
arXiv Detail & Related papers (2024-02-05T18:58:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.