Separators in Enhancing Autoregressive Pretraining for Vision Mamba
- URL: http://arxiv.org/abs/2603.03806v1
- Date: Wed, 04 Mar 2026 07:39:42 GMT
- Title: Separators in Enhancing Autoregressive Pretraining for Vision Mamba
- Authors: Hanpeng Liu, Zidan Wang, Shuoxi Zhang, Kaiyuan Gao, Kun He,
- Abstract summary: We introduce an innovative autoregressive pretraining method for Vision Mamba.<n>New textbfSeparatextbfTors for textbfAutotextbfRegressive pretraining to demarcate and differentiate between different images.
- Score: 14.94233154248831
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The state space model Mamba has recently emerged as a promising paradigm in computer vision, attracting significant attention due to its efficient processing of long sequence tasks. Mamba's inherent causal mechanism renders it particularly suitable for autoregressive pretraining. However, current autoregressive pretraining methods are constrained to short sequence tasks, failing to fully exploit Mamba's prowess in handling extended sequences. To address this limitation, we introduce an innovative autoregressive pretraining method for Vision Mamba that substantially extends the input sequence length. We introduce new \textbf{S}epara\textbf{T}ors for \textbf{A}uto\textbf{R}egressive pretraining to demarcate and differentiate between different images, known as \textbf{STAR}. Specifically, we insert identical separators before each image to demarcate its inception. This strategy enables us to quadruple the input sequence length of Vision Mamba while preserving the original dimensions of the dataset images. Employing this long sequence pretraining technique, our STAR-B model achieved an impressive accuracy of 83.5\% on ImageNet-1k, which is highly competitive in Vision Mamba. These results underscore the potential of our method in enhancing the performance of vision models through improved leveraging of long-range dependencies.
Related papers
- MambaEye: A Size-Agnostic Visual Encoder with Causal Sequential Processing [14.888533532729864]
MambaEye is a novel, causal sequential encoder that leverages the low complexity and causal-process based pure Mamba2 backbone.<n>Unlike previous Mamba-based vision encoders that often employ bidirectional processing, our strictly unidirectional approach preserves the inherent causality of State Space Models.<n>MambaEye exhibits robust performance across a wide range of image resolutions, especially at higher resolutions such as $15362$ on the ImageNet-1K classification task.
arXiv Detail & Related papers (2025-11-25T06:18:18Z) - Training-free Token Reduction for Vision Mamba [21.451182941570394]
Vision Mamba has emerged as a strong competitor to Vision Transformers (ViTs)<n>Applying token reduction techniques for ViTs to Vision Mamba leads to significant performance degradation.<n>We propose MTR, a training-free textbfMamba textbfToken textbfReduction framework.
arXiv Detail & Related papers (2025-07-18T16:11:28Z) - DefMamba: Deformable Visual State Space Model [65.50381013020248]
We propose a novel visual foundation model called DefMamba.<n>By combining a deformable scanning(DS) strategy, this model significantly improves its ability to learn image structures and detects changes in object details.<n>Numerous experiments have shown that DefMamba achieves state-of-the-art performance in various visual tasks.
arXiv Detail & Related papers (2025-04-08T08:22:54Z) - Dynamic Vision Mamba [41.84910346271891]
Mamba-based vision models have gained extensive attention as a result of being computationally more efficient than attention-based models.<n>For token redundancy, we analytically find that early token pruning methods will result in inconsistency between training and inference.<n>For block redundancy, we allow each image to select SSM blocks dynamically based on an empirical observation that the inference speed of Mamba-based vision models is largely affected by the number of SSM blocks.
arXiv Detail & Related papers (2025-04-07T07:31:28Z) - Mamba-CL: Optimizing Selective State Space Model in Null Space for Continual Learning [54.19222454702032]
Continual Learning aims to equip AI models with the ability to learn a sequence of tasks over time, without forgetting previously learned knowledge.<n>State Space Models (SSMs) have achieved notable success in computer vision.<n>We introduce Mamba-CL, a framework that continuously fine-tunes the core SSMs of the large-scale Mamba foundation model.
arXiv Detail & Related papers (2024-11-23T06:36:16Z) - MAP: Unleashing Hybrid Mamba-Transformer Vision Backbone's Potential with Masked Autoregressive Pretraining [23.37555991996508]
We propose Masked Autoregressive Pretraining (MAP) to pretrain a hybrid Mamba-Transformer vision backbone network.<n> Experimental results show that the hybrid Mamba-Transformer vision backbone network pretrained with MAP significantly outperforms other pretraining strategies.
arXiv Detail & Related papers (2024-10-01T17:05:08Z) - MambaMIM: Pre-training Mamba with State Space Token Interpolation and its Application to Medical Image Segmentation [23.67774523461722]
We propose a general-purpose pre-training framework called MambaMIM.<n>MambaMIM learns causal relationships of state space within a masked sequence.<n>We pre-train MambaMIM on a large-scale dataset of 6.8K CT scans.
arXiv Detail & Related papers (2024-08-15T10:35:26Z) - MambaVision: A Hybrid Mamba-Transformer Vision Backbone [54.965143338206644]
We propose a novel hybrid Mamba-Transformer backbone, MambaVision, specifically tailored for vision applications.<n>We show that equipping the Mamba architecture with self-attention blocks in the final layers greatly improves its capacity to capture long-range spatial dependencies.<n>For classification on the ImageNet-1K dataset, MambaVision variants achieve state-of-the-art (SOTA) performance in terms of both Top-1 accuracy and throughput.
arXiv Detail & Related papers (2024-07-10T23:02:45Z) - DeciMamba: Exploring the Length Extrapolation Potential of Mamba [89.07242846058023]
We introduce DeciMamba, a context-extension method specifically designed for Mamba.<n>Experiments over real-world long-range NLP tasks show that DeciMamba can extrapolate to context lengths significantly longer than the ones seen during training.
arXiv Detail & Related papers (2024-06-20T17:40:18Z) - Autoregressive Pretraining with Mamba in Vision [45.25546594814871]
This paper shows that Mamba's visual capability can be significantly enhanced through autoregressive pretraining.
Performance-wise, autoregressive pretraining equips the Mamba architecture with markedly higher accuracy.
Our huge-size Mamba attains 85.0% ImageNet accuracy when finetuned with $384times384$ inputs.
arXiv Detail & Related papers (2024-06-11T17:58:34Z) - Demystify Mamba in Vision: A Linear Attention Perspective [72.93213667713493]
Mamba is an effective state space model with linear computation complexity.<n>We show that Mamba shares surprising similarities with linear attention Transformer.<n>We propose a Mamba-Inspired Linear Attention (MILA) model by incorporating the merits of these two key designs into linear attention.
arXiv Detail & Related papers (2024-05-26T15:31:09Z) - Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining [85.08169822181685]
This paper introduces a novel Mamba-based model, Swin-UMamba, designed specifically for medical image segmentation tasks.
Swin-UMamba demonstrates superior performance with a large margin compared to CNNs, ViTs, and latest Mamba-based models.
arXiv Detail & Related papers (2024-02-05T18:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.