Related papers: Autoregressive Pretraining with Mamba in Vision

Autoregressive Pretraining with Mamba in Vision

URL: http://arxiv.org/abs/2406.07537v1
Date: Tue, 11 Jun 2024 17:58:34 GMT
Title: Autoregressive Pretraining with Mamba in Vision
Authors: Sucheng Ren, Xianhang Li, Haoqin Tu, Feng Wang, Fangxun Shu, Lei Zhang, Jieru Mei, Linjie Yang, Peng Wang, Heng Wang, Alan Yuille, Cihang Xie,
Abstract summary: This paper shows that Mamba's visual capability can be significantly enhanced through autoregressive pretraining. Performance-wise, autoregressive pretraining equips the Mamba architecture with markedly higher accuracy. Our huge-size Mamba attains 85.0% ImageNet accuracy when finetuned with $384times384$ inputs.
Score: 45.25546594814871
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The vision community has started to build with the recently developed state space model, Mamba, as the new backbone for a range of tasks. This paper shows that Mamba's visual capability can be significantly enhanced through autoregressive pretraining, a direction not previously explored. Efficiency-wise, the autoregressive nature can well capitalize on the Mamba's unidirectional recurrent structure, enabling faster overall training speed compared to other training strategies like mask modeling. Performance-wise, autoregressive pretraining equips the Mamba architecture with markedly higher accuracy over its supervised-trained counterparts and, more importantly, successfully unlocks its scaling potential to large and even huge model sizes. For example, with autoregressive pretraining, a base-size Mamba attains 83.2\% ImageNet accuracy, outperforming its supervised counterpart by 2.0\%; our huge-size Mamba, the largest Vision Mamba to date, attains 85.0\% ImageNet accuracy (85.5\% when finetuned with $384\times384$ inputs), notably surpassing all other Mamba variants in vision. The code is available at \url{https://github.com/OliverRensu/ARM}.

Related papers

Mamba-OTR: a Mamba-based Solution for Online Take and Release Detection from Untrimmed Egocentric Video [57.805927523341516]
Mamba-OTR is designed to exploit temporal recurrence during inference while being trained on short video clips.<n>Mamba-OTR achieves a noteworthy mp-mAP of 45.48 when operating in a sliding-window fashion.<n>We will publicly release the source code of Mamba-OTR to support future research.
arXiv Detail & Related papers (2025-07-22T08:23:51Z)
Can Mamba Always Enjoy the "Free Lunch"? [9.024844892536327]
Transformers have been the cornerstone of current Large Language Models (LLMs) Mamba has gradually attracted attention due to its constant-level size during inference. Our results suggest that to solve arbitrary DP problems, the total cost of Mamba is comparable to standard and efficient Transformers.
arXiv Detail & Related papers (2024-10-04T13:31:24Z)
MAP: Unleashing Hybrid Mamba-Transformer Vision Backbone's Potential with Masked Autoregressive Pretraining [23.37555991996508]
We propose Masked Autoregressive Pretraining (MAP) to pretrain a hybrid Mamba-Transformer vision backbone network. We show that both the pure Mamba architecture and the hybrid Mamba-Transformer vision backbone network pretrained with MAP significantly outperform other pretraining strategies.
arXiv Detail & Related papers (2024-10-01T17:05:08Z)
MambaMIM: Pre-training Mamba with State Space Token-interpolation [14.343466340528687]
We introduce a generative self-supervised learning method for Mamba (MambaMIM) based on Selective Structure State Space Sequence Token-interpolation (S6T) MambaMIM can be used on any single or hybrid Mamba architectures to enhance the Mamba long-range representation capability.
arXiv Detail & Related papers (2024-08-15T10:35:26Z)
MambaVision: A Hybrid Mamba-Transformer Vision Backbone [54.965143338206644]
We propose a novel hybrid Mamba-Transformer backbone, denoted as MambaVision, which is specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. We conduct a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba.
arXiv Detail & Related papers (2024-07-10T23:02:45Z)
Snakes and Ladders: Two Steps Up for VideoMamba [10.954210339694841]
In this paper, we theoretically analyze the differences between self-attention and Mamba. We propose VideoMambaPro models that surpass VideoMamba by 1.6-2.8% and 1.1-1.9% top-1. Our two solutions are to recent advances in Vision Mamba models, and are likely to provide further improvements in future models.
arXiv Detail & Related papers (2024-06-27T08:45:31Z)
Demystify Mamba in Vision: A Linear Attention Perspective [72.93213667713493]
Mamba is an effective state space model with linear computation complexity. We show that Mamba shares surprising similarities with linear attention Transformer. We propose a Mamba-Like Linear Attention (MLLA) model by incorporating the merits of these two key designs into linear attention.
arXiv Detail & Related papers (2024-05-26T15:31:09Z)
Mamba-R: Vision Mamba ALSO Needs Registers [45.41648622999754]
Similar to Vision Transformers, this paper identifies artifacts also present within the feature maps of Vision Mamba. These artifacts, corresponding to high-norm tokens emerging in low-information background areas of images, appear much more severe in Vision Mamba. To mitigate this issue, we follow the prior solution of introducing register tokens into Vision Mamba.
arXiv Detail & Related papers (2024-05-23T17:58:43Z)
MambaOut: Do We Really Need Mamba for Vision? [70.60495392198686]
Mamba, an architecture with RNN-like token mixer of state space model (SSM), was recently introduced to address the quadratic complexity of the attention mechanism. This paper conceptually concludes that Mamba is ideally suited for tasks with long-sequence and autoregressive characteristics. We construct a series of models named MambaOut through stacking Mamba blocks while removing their core token mixer, SSM.
arXiv Detail & Related papers (2024-05-13T17:59:56Z)
CLIP-Mamba: CLIP Pretrained Mamba Models with OOD and Hessian Evaluation [18.383760896304604]
This report introduces the first attempt to train a Mamba model utilizing contrastive technical-image pretraining (CLIP) A Mamba model 67 million parameters is on par with a 307 million- parameters Vision Transformer (ViT) model in zero-shot classification tasks.
arXiv Detail & Related papers (2024-04-30T09:40:07Z)
Visual Mamba: A Survey and New Outlooks [33.90213491829634]
Mamba, a recent selective structured state space model, excels in long sequence modeling. Since January 2024, Mamba has been actively applied to diverse computer vision tasks. This paper reviews visual Mamba approaches, analyzing over 200 papers.
arXiv Detail & Related papers (2024-04-29T16:51:30Z)
Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining [85.08169822181685]
This paper introduces a novel Mamba-based model, Swin-UMamba, designed specifically for medical image segmentation tasks. Swin-UMamba demonstrates superior performance with a large margin compared to CNNs, ViTs, and latest Mamba-based models.
arXiv Detail & Related papers (2024-02-05T18:58:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.