Learning Spatial Decay for Vision Transformers
- URL: http://arxiv.org/abs/2508.09525v1
- Date: Wed, 13 Aug 2025 06:18:32 GMT
- Title: Learning Spatial Decay for Vision Transformers
- Authors: Yuxin Mao, Zhen Qin, Jinxing Zhou, Bin Fan, Jing Zhang, Yiran Zhong, Yuchao Dai,
- Abstract summary: Vision Transformers (ViTs) have revolutionized computer vision, yet their self-attention mechanism lacks explicit spatial inductive biases.<n>Existing approaches introduce data-independent spatial decay based on fixed distance metrics.<n>We present the first successful adaptation of data-dependent spatial decay to 2D vision transformers.
- Score: 50.63391799053993
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformers (ViTs) have revolutionized computer vision, yet their self-attention mechanism lacks explicit spatial inductive biases, leading to suboptimal performance on spatially-structured tasks. Existing approaches introduce data-independent spatial decay based on fixed distance metrics, applying uniform attention weighting regardless of image content and limiting adaptability to diverse visual scenarios. Inspired by recent advances in large language models where content-aware gating mechanisms (e.g., GLA, HGRN2, FOX) significantly outperform static alternatives, we present the first successful adaptation of data-dependent spatial decay to 2D vision transformers. We introduce \textbf{Spatial Decay Transformer (SDT)}, featuring a novel Context-Aware Gating (CAG) mechanism that generates dynamic, data-dependent decay for patch interactions. Our approach learns to modulate spatial attention based on both content relevance and spatial proximity. We address the fundamental challenge of 1D-to-2D adaptation through a unified spatial-content fusion framework that integrates manhattan distance-based spatial priors with learned content representations. Extensive experiments on ImageNet-1K classification and generation tasks demonstrate consistent improvements over strong baselines. Our work establishes data-dependent spatial decay as a new paradigm for enhancing spatial attention in vision transformers.
Related papers
- Autoregressive Image Generation with Linear Complexity: A Spatial-Aware Decay Perspective [47.87649021414188]
We present LASADGen, an autoregressive image generator that enables selective attention to relevant spatial contexts with linear complexity.<n>Experiments on ImageNet show LASADGen achieves state-of-the-art image generation performance and computational efficiency.
arXiv Detail & Related papers (2025-07-02T12:27:06Z) - Transformers Meet Hyperspectral Imaging: A Comprehensive Study of Models, Challenges and Open Problems [0.0]
We review more than 300 papers published up to 2025 and present the first end-to-end survey dedicated to Transformer-based HSI classification.<n>The study categorizes every stage of a typical pipeline-pre-processing, patch or pixel tokenization, positional encoding, spatial-spectral feature extraction, multi-head self-attention variants, skip connections, and loss design.<n>We outline a research agenda prioritizing valuable public data sets, lightweight on-edge models, illumination and sensor shifts, and intrinsically interpretable attention mechanisms.
arXiv Detail & Related papers (2025-06-10T09:04:30Z) - SEM: Enhancing Spatial Understanding for Robust Robot Manipulation [13.620151960111764]
SEM (Spatial Enhanced Manipulation model) is a novel diffusion-based policy framework that enhances spatial understanding from two complementary perspectives.<n>A spatial enhancer augments visual representations with 3D geometric context, while a robot state encoder captures embodiment-aware structure through graphbased modeling of joint dependencies.
arXiv Detail & Related papers (2025-05-22T04:00:12Z) - Ground-level Viewpoint Vision-and-Language Navigation in Continuous Environments [10.953629652228024]
Vision-and-Language Navigation (VLN) agents associate time-sequenced visual observations with corresponding instructions to make decisions.<n>In this paper, we address the mismatch between human-centric instructions and quadruped robots with a low-height field of view.<n>We propose a Ground-level Viewpoint Navigation (GVNav) approach to mitigate this issue.
arXiv Detail & Related papers (2025-02-26T10:30:40Z) - EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation [55.26713167507132]
We introduce a generative robotics foundation model that constructs and interprets embodied spaces.<n>EnerVerse employs an autoregressive video diffusion framework to predict future embodied spaces from instructions, enhanced by a sparse context memory for long-term reasoning.<n>We present EnerVerse-D, a data engine pipeline combining the generative model with 4D Gaussian Splatting, forming a self-reinforcing data loop to reduce the sim-to-real gap.
arXiv Detail & Related papers (2025-01-03T17:00:33Z) - Efficient High-Resolution Visual Representation Learning with State Space Model for Human Pose Estimation [60.80423207808076]
Capturing long-range dependencies while preserving high-resolution visual representations is crucial for dense prediction tasks such as human pose estimation.<n>We propose the Dynamic Visual State Space (DVSS) block, which augments visual state space models with multi-scale convolutional operations.<n>We build HRVMamba, a novel model for efficient high-resolution representation learning.
arXiv Detail & Related papers (2024-10-04T06:19:29Z) - Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment [130.15775113897553]
Finsta is a fine-grained structural-temporal alignment learning method.
It consistently improves the existing 13 strong-tuning video-language models.
arXiv Detail & Related papers (2024-06-27T15:23:36Z) - Dual Aggregation Transformer for Image Super-Resolution [92.41781921611646]
We propose a novel Transformer model, Dual Aggregation Transformer, for image SR.
Our DAT aggregates features across spatial and channel dimensions, in the inter-block and intra-block dual manner.
Our experiments show that our DAT surpasses current methods.
arXiv Detail & Related papers (2023-08-07T07:39:39Z) - 2-D SSM: A General Spatial Layer for Visual Transformers [79.4957965474334]
A central objective in computer vision is to design models with appropriate 2-D inductive bias.
We leverage an expressive variation of the multidimensional State Space Model.
Our approach introduces efficient parameterization, accelerated computation, and a suitable normalization scheme.
arXiv Detail & Related papers (2023-06-11T09:41:37Z) - Ripple Attention for Visual Perception with Sub-quadratic Complexity [7.425337104538644]
Transformer architectures are now central to modeling in natural language processing tasks.
We propose ripple attention, a sub-quadratic attention mechanism for visual perception.
In ripple attention, contributions of different tokens to a query are weighted with respect to their relative spatial distances in the 2D space.
arXiv Detail & Related papers (2021-10-06T02:00:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.