MambaScope: Coarse-to-Fine Scoping for Efficient Vision Mamba
- URL: http://arxiv.org/abs/2512.00647v2
- Date: Wed, 03 Dec 2025 10:45:29 GMT
- Title: MambaScope: Coarse-to-Fine Scoping for Efficient Vision Mamba
- Authors: Shanhui Liu, Rui Xu, Yunke Wang,
- Abstract summary: We propose MambaScope, an adaptive framework for efficient inference for Vision Mamba.<n>MambaScope first performs coarse-grained inference by dividing the input image into large patches.<n>When the model's prediction confidence is low, selected regions are re-processed at a finer resolution to recover essential visual details.
- Score: 8.769339443165029
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Mamba has emerged as a promising and efficient alternative to Vision Transformers, yet its efficiency remains fundamentally constrained by the number of input tokens. Existing token reduction approaches typically adopt token pruning or merging to reduce computation. However, they inherently lead to information loss as they discard or compress token representations. This problem is further exacerbated when the same fine-grained token processing is uniformly applied across all images regardless of visual complexity. We observe that not all inputs require fine-grained processing: simple images can be effectively handled at a coarse resolution, while only complex ones require refinement. Based on this insight, we propose MambaScope, an adaptive framework for efficient inference for Vision Mamba. MambaScope first performs coarse-grained inference by dividing the input image into large patches, significantly reducing token length and computation. When the model's prediction confidence is low, selected regions are re-processed at a finer resolution to recover essential visual details with minimal additional cost. This dynamic resolution assignment strategy allows MambaScope to allocate computation adaptively according to image complexity, achieving efficient processing without compromising accuracy. Experiments across various vision tasks demonstrate that MambaScope outperforms both the baseline Vision Mamba and state-of-the-art token reduction techniques in terms of accuracy and efficiency.
Related papers
- MambaEye: A Size-Agnostic Visual Encoder with Causal Sequential Processing [14.888533532729864]
MambaEye is a novel, causal sequential encoder that leverages the low complexity and causal-process based pure Mamba2 backbone.<n>Unlike previous Mamba-based vision encoders that often employ bidirectional processing, our strictly unidirectional approach preserves the inherent causality of State Space Models.<n>MambaEye exhibits robust performance across a wide range of image resolutions, especially at higher resolutions such as $15362$ on the ImageNet-1K classification task.
arXiv Detail & Related papers (2025-11-25T06:18:18Z) - CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning [15.733788584792388]
We propose Contextually Adaptive Token Pruning (CATP), a training-free pruning method targeted at multimodal in-context learning (ICL)<n>After removing 77.8% of the image tokens, CATP produces an average performance gain of 0.6% over the vanilla model on four LVLMs and eight benchmarks.<n>It effectively improves efficiency by achieving an average reduction of 10.78% in latency.
arXiv Detail & Related papers (2025-08-11T11:41:51Z) - VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization [70.98122339799218]
Large Multimodal Models (LMMs) excel in visual-language tasks by leveraging numerous visual tokens for fine-grained visual information.<n>Previous research aimed at reducing visual tokens during inference typically leverages importance maps derived from attention scores among vision-only tokens or vision-language tokens to prune tokens across one or multiple pruning stages.<n>We propose VFlowOpt, a token pruning framework that introduces an importance map derivation process and a progressive pruning module with a recycling mechanism.<n> Experiments demonstrate that VFlowOpt can prune 90% of visual tokens while maintaining comparable performance, leading to an 89% reduction in KV-Cache memory and 3.8
arXiv Detail & Related papers (2025-08-07T09:47:21Z) - Training-free Token Reduction for Vision Mamba [21.451182941570394]
Vision Mamba has emerged as a strong competitor to Vision Transformers (ViTs)<n>Applying token reduction techniques for ViTs to Vision Mamba leads to significant performance degradation.<n>We propose MTR, a training-free textbfMamba textbfToken textbfReduction framework.
arXiv Detail & Related papers (2025-07-18T16:11:28Z) - VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning [95.89543460132413]
Vision-language models (VLMs) have improved performance by increasing the number of visual tokens.<n>However, most real-world scenarios do not require such an extensive number of visual tokens.<n>We present a new paradigm for visual token compression, namely, VisionThink.
arXiv Detail & Related papers (2025-07-17T17:59:55Z) - EAMamba: Efficient All-Around Vision State Space Model for Image Restoration [11.190025966582041]
This study introduces Efficient All-Around Mamba (EAMamba), an enhanced framework that incorporates a Multi-Head Selective Scan Module (MHSSM) with an all-around scanning mechanism.<n>EAMamba achieves a significant 31-89% reduction in FLOPs while maintaining favorable performance compared to existing low-level Vision Mamba methods.
arXiv Detail & Related papers (2025-06-27T14:12:58Z) - ToDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models [59.47738955960352]
ToDRE is a two-stage and training-free token compression framework.<n>It achieves superior performance by pruning tokens based on token Diversity and token-task RElevance.
arXiv Detail & Related papers (2025-05-24T15:47:49Z) - Streamline Without Sacrifice -- Squeeze out Computation Redundancy in LMM [41.796933489107815]
We identify and study the computation-level redundancy on vision tokens to ensure no information loss.<n>We propose ProxyV, a novel approach that utilizes proxy vision tokens to alleviate the computational burden on original vision tokens.
arXiv Detail & Related papers (2025-05-21T17:59:52Z) - Dynamic Vision Mamba [41.84910346271891]
Mamba-based vision models have gained extensive attention as a result of being computationally more efficient than attention-based models.<n>For token redundancy, we analytically find that early token pruning methods will result in inconsistency between training and inference.<n>For block redundancy, we allow each image to select SSM blocks dynamically based on an empirical observation that the inference speed of Mamba-based vision models is largely affected by the number of SSM blocks.
arXiv Detail & Related papers (2025-04-07T07:31:28Z) - Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
High-resolution images and videos pose a barrier to their broader adoption.<n> compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs.<n>We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
arXiv Detail & Related papers (2024-11-26T09:36:02Z) - Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks.<n>To reduce inference costs, one can either downsize the Large Language Models (LLMs) or reduce the number of input tokens needed to represent the image.<n>We take the first steps toward designing token compression algorithms tailored for high-compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z) - GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation [30.343504537684755]
Vision Transformers (ViTs) have revolutionized the field of computer vision, yet their deployments on resource-constrained devices remain challenging.
To expedite ViTs, token pruning and token merging approaches have been developed, which aim at reducing the number of tokens involved in computation.
We introduce a novel Graph-based Token Propagation (GTP) method to resolve the challenge of balancing model efficiency and information preservation for efficient ViTs.
arXiv Detail & Related papers (2023-11-06T11:14:19Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - Token Pooling in Vision Transformers [37.11990688046186]
In vision transformers, self-attention is not the major bottleneck, e.g., more than 80% of the computation is spent on fully-connected layers.
We propose a novel token downsampling method, called Token Pooling, efficiently exploiting redundancies in the images and intermediate token representations.
Our experiments show that Token Pooling significantly improves the cost-accuracy trade-off over the state-of-the-art downsampling.
arXiv Detail & Related papers (2021-10-08T02:22:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.