Interpretable Vision Transformers in Monocular Depth Estimation via SVDA
- URL: http://arxiv.org/abs/2602.11005v1
- Date: Wed, 11 Feb 2026 16:27:15 GMT
- Title: Interpretable Vision Transformers in Monocular Depth Estimation via SVDA
- Authors: Vasileios Arampatzakis, George Pavlidis, Nikolaos Mitianoudis, Nikos Papamarkos,
- Abstract summary: We introduce SVD-Inspired Attention (SVDA) into the Dense Prediction Transformer (DPT)<n>SVDA decouples directional alignment from spectral modulation by embedding a learnable diagonal matrix into normalized query-key interactions.<n> Experiments on KITTI and NYU-v2 show that SVDA preserves or slightly improves predictive accuracy while adding only minor computational overhead.
- Score: 5.8833115420537085
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Monocular depth estimation is a central problem in computer vision with applications in robotics, AR, and autonomous driving, yet the self-attention mechanisms that drive modern Transformer architectures remain opaque. We introduce SVD-Inspired Attention (SVDA) into the Dense Prediction Transformer (DPT), providing the first spectrally structured formulation of attention for dense prediction tasks. SVDA decouples directional alignment from spectral modulation by embedding a learnable diagonal matrix into normalized query-key interactions, enabling attention maps that are intrinsically interpretable rather than post-hoc approximations. Experiments on KITTI and NYU-v2 show that SVDA preserves or slightly improves predictive accuracy while adding only minor computational overhead. More importantly, SVDA unlocks six spectral indicators that quantify entropy, rank, sparsity, alignment, selectivity, and robustness. These reveal consistent cross-dataset and depth-wise patterns in how attention organizes during training, insights that remain inaccessible in standard Transformers. By shifting the role of attention from opaque mechanism to quantifiable descriptor, SVDA redefines interpretability in monocular depth estimation and opens a principled avenue toward transparent dense prediction models.
Related papers
- STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning [65.36458157092207]
In vision-language models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations.<n>We propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities.<n>We introduce STVG-R1, the first reinforcement learning framework for STVG, which employs a task-driven reward to jointly optimize temporal accuracy, spatial consistency, and structural format regularization.
arXiv Detail & Related papers (2026-02-12T08:53:32Z) - Interpretable Vision Transformers in Image Classification via SVDA [5.8833115420537085]
Vision Transformers (ViTs) have achieved state-of-the-art performance in image classification, yet their attention mechanisms often remain opaque and exhibit dense, non-structured behaviors.<n>We adapt our previously proposed SVD-Inspired Attention (SVDA) mechanism to the ViT architecture, introducing a geometrically grounded formulation that enhances interpretability, sparsity, and spectral structure.
arXiv Detail & Related papers (2026-02-11T16:20:32Z) - Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving [48.512353531499286]
We introduce Percept-WAM, a perception-enhanced World-Awareness-Action Model that implicitly integrates 2D/3D scene understanding abilities within a single vision-language model (VLM)<n>We propose a grid-conditioned prediction mechanism for dense object perception, incorporating IoU-aware scoring and parallel autoregressive decoding, improving stability in long-tail, far-range, and small-object scenarios.<n>Experiments show that Percept-WAM matches or surpasses classical detectors and segmenters on downstream perception benchmarks, achieving 51.7/58.9 mAP on 2D detection and nuScenes BEV 3D detection
arXiv Detail & Related papers (2025-11-24T15:28:25Z) - Stabilizing Information Flow Entropy: Regularization for Safe and Interpretable Autonomous Driving Perception [8.543667347406286]
We reconceptualize deep neural encoders as hierarchical communication chains that compress raw sensory inputs into task-relevant latent features.<n>We propose Eloss, a novel entropy-based regularizer designed as a lightweight, plug-and-play training objective.
arXiv Detail & Related papers (2025-09-18T17:01:27Z) - SVDformer: Direction-Aware Spectral Graph Embedding Learning via SVD and Transformer [24.552037222044504]
SVDformer is a novel framework that synergizes SVD and Transformer architecture for direction-aware graph representation learning.<n> experiments on six directed graph benchmarks demonstrate that SVDformer consistently outperforms state-of-the-art GNNs and direction-aware baselines on node classification tasks.
arXiv Detail & Related papers (2025-08-19T01:32:18Z) - Transformer Meets Twicing: Harnessing Unattended Residual Information [2.1605931466490795]
Transformer-based deep learning models have achieved state-of-the-art performance across numerous language and vision tasks.<n>While the self-attention mechanism has proven capable of handling complex data patterns, it has been observed that the representational capacity of the attention matrix degrades significantly across transformer layers.<n>We propose the Twicing Attention, a novel attention mechanism that uses kernel twicing procedure in nonparametric regression to alleviate the low-pass behavior of associated NLM smoothing.
arXiv Detail & Related papers (2025-03-02T01:56:35Z) - Optical aberrations in autonomous driving: Physics-informed parameterized temperature scaling for neural network uncertainty calibration [49.03824084306578]
We propose to incorporate a physical inductive bias into the neural network calibration architecture to enhance the robustness and the trustworthiness of the AI target application.<n>We pave the way for a trustworthy uncertainty representation and for a holistic verification strategy of the perception chain.
arXiv Detail & Related papers (2024-12-18T10:36:46Z) - DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - Self-supervised Multi-future Occupancy Forecasting for Autonomous Driving [31.995016095663544]
LiDAR-generated occupancy grid maps (L-OGMs) offer a robust bird's-eye view for the scene representation.<n>Our proposed framework, Latent Occupancy Prediction (LOPR), performs L-OGM prediction in the latent space of a generative architecture.
arXiv Detail & Related papers (2024-07-30T18:37:59Z) - 2-D SSM: A General Spatial Layer for Visual Transformers [79.4957965474334]
A central objective in computer vision is to design models with appropriate 2-D inductive bias.
We leverage an expressive variation of the multidimensional State Space Model.
Our approach introduces efficient parameterization, accelerated computation, and a suitable normalization scheme.
arXiv Detail & Related papers (2023-06-11T09:41:37Z) - Forecasting of depth and ego-motion with transformers and
self-supervision [0.0]
This paper addresses the problem of end-to-end self-supervised forecasting of depth and ego motion.
Given a sequence of raw images, the aim is to forecast both the geometry and ego-motion using a supervised self photometric loss.
The architecture is designed using both convolution and transformer modules.
arXiv Detail & Related papers (2022-06-15T10:14:11Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.