Related papers: OCTOPUS: Enhancing the Spatial-Awareness of Vision SSMs with Multi-Dimensional Scans and Traversal Selection

OCTOPUS: Enhancing the Spatial-Awareness of Vision SSMs with Multi-Dimensional Scans and Traversal Selection

URL: http://arxiv.org/abs/2602.00904v1
Date: Sat, 31 Jan 2026 21:12:59 GMT
Title: OCTOPUS: Enhancing the Spatial-Awareness of Vision SSMs with Multi-Dimensional Scans and Traversal Selection
Authors: Kunal Mahatha, Ali Bahri, Pierre Marza, Sahar Dastani, Maria Vakalopoulou, Stergios Christodoulidis, Jose Dolz, Christian Desrosiers,
Abstract summary: We introduce OCTOPUS, a novel architecture that preserves both global context and local spatial structure within images.<n>OCTOPUS performs discrete reoccurrence along eight principal orientations, going forward or backward in the horizontal, vertical, and diagonal directions.<n>In our classification and segmentation benchmarks, OCTOPUS demonstrates notable improvements in boundary preservation and region consistency.
Score: 20.717476762904038
License: http://creativecommons.org/licenses/by/4.0/
Abstract: State space models (SSMs) have recently emerged as an alternative to transformers due to their unique ability of modeling global relationships in text with linear complexity. However, their success in vision tasks has been limited due to their causal formulation, which is suitable for sequential text but detrimental in the spatial domain where causality breaks the inherent spatial relationships among pixels or patches. As a result, standard SSMs fail to capture local spatial coherence, often linking non-adjacent patches while ignoring neighboring ones that are visually correlated. To address these limitations, we introduce OCTOPUS , a novel architecture that preserves both global context and local spatial structure within images, while maintaining the linear complexity of SSMs. OCTOPUS performs discrete reoccurrence along eight principal orientations, going forward or backward in the horizontal, vertical, and diagonal directions, allowing effective information exchange across all spatially connected regions while maintaining independence among unrelated patches. This design enables multi-directional recurrence, capturing both global context and local spatial structure with SSM-level efficiency. In our classification and segmentation benchmarks, OCTOPUS demonstrates notable improvements in boundary preservation and region consistency, as evident from the segmentation results, while maintaining relatively better classification accuracy compared to existing V-SSM based models. These results suggest that OCTOPUS appears as a foundation method for multi-directional recurrence as a scalable and effective mechanism for building spatially aware and computationally efficient vision architectures.

Related papers

Cross-view geo-localization, Image retrieval, Multiscale geometric modeling, Frequency domain enhancement [1.6686955491488273]
Cross-view geo-localization (CVGL) aims to establish spatial correspondences between images captured from significantly different viewpoints.<n>CVGL remains challenging due to severe geometric asymmetry, texture inconsistency across imaging domains, and the progressive degradation of discriminative local information.<n>This paper proposes the Spatial and Frequency Domain Enhancement Network (SFDE), which leverages complementary representations from spatial and frequency domains.
arXiv Detail & Related papers (2026-03-03T08:25:35Z)
SENDAI: A Hierarchical Sparse-measurement, EfficieNt Data AssImilation Framework [2.070436148502153]
SENDAI is a hierarchical Sparse-measurement framework that reconstructs full spatial states from sparse sensor observations.<n>S SENDAI achieves a maximum SSIM improvement of 185% over traditional baselines and a 36% improvement over recent high-frequency-based methods.
arXiv Detail & Related papers (2026-01-29T12:58:54Z)
RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning [61.84363374647606]
Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions.<n>These descriptions often rely heavily on positional cues, posing unique challenges for Multimodal Large Language Models (MLLMs) in spatial reasoning.<n>We propose a reasoning-guided, position-aware post-training framework, dubbed textbfRSGround-R1, to progressively enhance spatial understanding.
arXiv Detail & Related papers (2026-01-29T12:35:57Z)
A Multi-scale Fused Graph Neural Network with Inter-view Contrastive Learning for Spatial Transcriptomics Data Clustering [7.214595408714774]
stMFG is proposed, a multi-scale interactive fusion graph network that introduces layer-wise cross-view attention to dynamically integrate spatial and gene features after each convolution.<n>It outperforms state-of-the-art methods, achieving up to 14% ARI improvement on certain slices.
arXiv Detail & Related papers (2025-12-18T05:13:55Z)
Graph Laplacian Transformer with Progressive Sampling for Prostate Cancer Grading [2.9485900021889146]
We propose a Graph Laplacian Attention-Based Transformer (GLAT) integrated with an Iterative Refinement Module (IRM) to enhance both feature learning and spatial consistency.<n>IRM iteratively refines patch selection by leveraging a pretrained ResNet50 for local feature extraction and a foundation model in no-gradient mode for importance scoring.<n>The GLAT models tissue-level connectivity by constructing a graph where patches serve as nodes, ensuring spatial consistency through graph Laplacian constraints.
arXiv Detail & Related papers (2025-12-11T16:55:57Z)
SSCM: A Spatial-Semantic Consistent Model for Multi-Contrast MRI Super-Resolution [11.194678655584788]
MC-MRI SR aims to enhance low-resolution (LR) contrasts leveraging high-resolution (HR) references.<n>Main challenge lies in maintaining spatial-semantic consistency.
arXiv Detail & Related papers (2025-09-23T03:24:32Z)
ST-LINK: Spatially-Aware Large Language Models for Spatio-Temporal Forecasting [7.853736939635847]
We introduce ST-LINK, a novel framework that enhances the capability of Large Language Models to capture sequential-temporal dependencies.<n>Its key components are spatially-Enhanced Attention (SE-Attention) and the Memory Retrieval Feed-Forward Network (MRFFN)
arXiv Detail & Related papers (2025-09-17T07:11:45Z)
HierRelTriple: Guiding Indoor Layout Generation with Hierarchical Relationship Triplet Losses [52.70183252341687]
We present a hierarchical triplet-based indoor relationship learning method, coined HierRelTriple, with a focus on spatial relationship learning.<n>We introduce HierRelTriple, a hierarchical relational triplets modeling framework that first partitions functional regions and then automatically extracts three levels of spatial relationships.<n>Experiments on unconditional layout synthesis, floorplan-conditioned layout generation, and scene rearrangement demonstrate that HierRel improves spatial-relation metrics by over 15%.
arXiv Detail & Related papers (2025-03-26T07:31:52Z)
Parallel Sequence Modeling via Generalized Spatial Propagation Network [80.66202109995726]
Generalized Spatial Propagation Network (GSPN) is a new attention mechanism for optimized vision tasks that inherently captures 2D spatial structures.<n>GSPN overcomes limitations by directly operating on spatially coherent image data and forming dense pairwise connections through a line-scan approach.<n>GSPN achieves superior spatial fidelity and state-of-the-art performance in vision tasks, including ImageNet classification, class-guided image generation, and text-to-image generation.
arXiv Detail & Related papers (2025-01-21T18:56:19Z)
Efficient High-Resolution Visual Representation Learning with State Space Model for Human Pose Estimation [60.80423207808076]
Capturing long-range dependencies while preserving high-resolution visual representations is crucial for dense prediction tasks such as human pose estimation.<n>We propose the Dynamic Visual State Space (DVSS) block, which augments visual state space models with multi-scale convolutional operations.<n>We build HRVMamba, a novel model for efficient high-resolution representation learning.
arXiv Detail & Related papers (2024-10-04T06:19:29Z)
Cross-Scan Mamba with Masked Training for Robust Spectral Imaging [51.557804095896174]
We propose the Cross-Scanning Mamba, named CS-Mamba, that employs a Spatial-Spectral SSM for global-local balanced context encoding.<n>Experiment results show that our CS-Mamba achieves state-of-the-art performance and the masked training method can better reconstruct smooth features to improve the visual quality.
arXiv Detail & Related papers (2024-08-01T15:14:10Z)
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing. Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery. We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z)
Local-Global Temporal Difference Learning for Satellite Video Super-Resolution [53.03380679343968]
We propose to exploit the well-defined temporal difference for efficient and effective temporal compensation.<n>To fully utilize the local and global temporal information within frames, we systematically modeled the short-term and long-term temporal discrepancies.<n> Rigorous objective and subjective evaluations conducted across five mainstream video satellites demonstrate that our method performs favorably against state-of-the-art approaches.
arXiv Detail & Related papers (2023-04-10T07:04:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.