Scaling Vision Mamba Across Resolutions via Fractal Traversal
- URL: http://arxiv.org/abs/2505.14062v1
- Date: Tue, 20 May 2025 08:08:28 GMT
- Title: Scaling Vision Mamba Across Resolutions via Fractal Traversal
- Authors: Bo Li, Haoke Xiao, Lv Tang,
- Abstract summary: We propose FractalMamba++, a vision backbone that leverages fractal-based patch serialization via Hilbert curves.<n>To address long-range dependency fading in high-resolution inputs, we introduce a Cross-State (CSR) mechanism that enhances global context propagation.<n>Experiments on image classification, semantic segmentation, object detection, and change detection demonstrate that FractalMamba++ consistently outperforms previous Mamba-based backbones.
- Score: 9.566046692165884
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Vision Mamba has recently emerged as a promising alternative to Transformer-based architectures, offering linear complexity in sequence length while maintaining strong modeling capacity. However, its adaptation to visual inputs is hindered by challenges in 2D-to-1D patch serialization and weak scalability across input resolutions. Existing serialization strategies such as raster scanning disrupt local spatial continuity and limit the model's ability to generalize across scales. In this paper, we propose FractalMamba++, a robust vision backbone that leverages fractal-based patch serialization via Hilbert curves to preserve spatial locality and enable seamless resolution adaptability. To address long-range dependency fading in high-resolution inputs, we further introduce a Cross-State Routing (CSR) mechanism that enhances global context propagation through selective state reuse. Additionally, we propose a Positional-Relation Capture (PRC) module to recover local adjacency disrupted by curve inflection points. Extensive experiments on image classification, semantic segmentation, object detection, and change detection demonstrate that FractalMamba++ consistently outperforms previous Mamba-based backbones, particularly under high-resolution settings.
Related papers
- Rotation Equivariant Arbitrary-scale Image Super-Resolution [62.41329042683779]
The arbitrary-scale image super-resolution (ASISR) aims to achieve arbitrary-scale high-resolution recoveries from a low-resolution input image.<n>We make efforts to construct a rotation equivariant ASISR method in this study.
arXiv Detail & Related papers (2025-08-07T08:51:03Z) - MVNet: Hyperspectral Remote Sensing Image Classification Based on Hybrid Mamba-Transformer Vision Backbone Architecture [12.168520751389622]
Hyperspectral image (HSI) classification faces challenges such as high-dimensional data, limited training samples, and spectral redundancy.<n>This paper proposes a novel MVNet network architecture that integrates 3D-CNN's local feature extraction, Transformer's global modeling, and Mamba's linear sequence modeling capabilities.<n>On IN, UP, and KSC datasets, MVNet outperforms mainstream hyperspectral image classification methods in both classification accuracy and computational efficiency.
arXiv Detail & Related papers (2025-07-06T14:52:26Z) - RD-UIE: Relation-Driven State Space Modeling for Underwater Image Enhancement [59.364418120895]
Underwater image enhancement (UIE) is a critical preprocessing step for marine vision applications.<n>We develop a novel relation-driven Mamba framework for effective UIE (RD-UIE)<n>Experiments on underwater enhancement benchmarks demonstrate RD-UIE outperforms the state-of-the-art approach WMamba.
arXiv Detail & Related papers (2025-05-02T12:21:44Z) - RSRWKV: A Linear-Complexity 2D Attention Mechanism for Efficient Remote Sensing Vision Task [20.16344973940904]
High-resolution remote sensing analysis faces challenges due to scene complexity and scale diversity.<n>We propose RSRWKV, featuring a novel 2D-WKV scanning mechanism that bridges sequential processing and 2D spatial reasoning.
arXiv Detail & Related papers (2025-03-26T10:03:46Z) - 2DMCG:2DMambawith Change Flow Guidance for Change Detection in Remote Sensing [4.18306618346671]
This paper proposes an efficient framework based on a Vision Mamba variant that enhances its ability to capture 2D spatial information.<n>The framework employs a 2DMamba encoder to effectively learn global contextual spatial information from multi-temporal images.<n>Experiments on benchmark datasets demonstrate the superior performance of our framework compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-03-01T14:55:13Z) - PseudoNeg-MAE: Self-Supervised Point Cloud Learning using Conditional Pseudo-Negative Embeddings [55.55445978692678]
PseudoNeg-MAE enhances global feature representation of point cloud masked autoencoders by making them both discriminative and sensitive to transformations.<n>We propose a novel loss that explicitly penalizes invariant collapse, enabling the network to capture richer transformation cues while preserving discriminative representations.
arXiv Detail & Related papers (2024-09-24T07:57:21Z) - Scalable Visual State Space Model with Fractal Scanning [16.077348474371547]
State Space Models (SSMs) have emerged as efficient alternatives to Transformer models.
We propose using fractal scanning curves for patch serialization.
We validate our method in image classification, detection, and segmentation tasks.
arXiv Detail & Related papers (2024-05-23T12:12:11Z) - MambaIR: A Simple Baseline for Image Restoration with State-Space Model [46.827053426281715]
We introduce MambaIR, which introduces both local enhancement and channel attention to improve the vanilla Mamba.
Our method outperforms SwinIR by up to 0.45dB on image SR, using similar computational cost but with a global receptive field.
arXiv Detail & Related papers (2024-02-23T23:15:54Z) - Dual-scale Enhanced and Cross-generative Consistency Learning for Semi-supervised Medical Image Segmentation [49.57907601086494]
Medical image segmentation plays a crucial role in computer-aided diagnosis.
We propose a novel Dual-scale Enhanced and Cross-generative consistency learning framework for semi-supervised medical image (DEC-Seg)
arXiv Detail & Related papers (2023-12-26T12:56:31Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations.
To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video.
In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z) - Improving the generalization of network based relative pose regression:
dimension reduction as a regularizer [16.63174637692875]
State-of-the-art visual localization methods perform pose estimation using geometry based solver within the RANSAC framework.
End-to-end learning based regression networks provide a solution to circumvent the requirement for precise pixel-level correspondences.
In this paper, we explicitly add a learnable matching layer within the network to isolate the pose regression solver from the absolute image feature values.
We implement this dimension regularization strategy within a two-layer pyramid based framework to regress the localization results from coarse to fine.
arXiv Detail & Related papers (2020-10-24T06:20:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.