Interpretable Vision Transformers in Image Classification via SVDA
- URL: http://arxiv.org/abs/2602.10994v1
- Date: Wed, 11 Feb 2026 16:20:32 GMT
- Title: Interpretable Vision Transformers in Image Classification via SVDA
- Authors: Vasileios Arampatzakis, George Pavlidis, Nikolaos Mitianoudis, Nikos Papamarkos,
- Abstract summary: Vision Transformers (ViTs) have achieved state-of-the-art performance in image classification, yet their attention mechanisms often remain opaque and exhibit dense, non-structured behaviors.<n>We adapt our previously proposed SVD-Inspired Attention (SVDA) mechanism to the ViT architecture, introducing a geometrically grounded formulation that enhances interpretability, sparsity, and spectral structure.
- Score: 5.8833115420537085
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Vision Transformers (ViTs) have achieved state-of-the-art performance in image classification, yet their attention mechanisms often remain opaque and exhibit dense, non-structured behaviors. In this work, we adapt our previously proposed SVD-Inspired Attention (SVDA) mechanism to the ViT architecture, introducing a geometrically grounded formulation that enhances interpretability, sparsity, and spectral structure. We apply the use of interpretability indicators -- originally proposed with SVDA -- to monitor attention dynamics during training and assess structural properties of the learned representations. Experimental evaluations on four widely used benchmarks -- CIFAR-10, FashionMNIST, CIFAR-100, and ImageNet-100 -- demonstrate that SVDA consistently yields more interpretable attention patterns without sacrificing classification accuracy. While the current framework offers descriptive insights rather than prescriptive guidance, our results establish SVDA as a comprehensive and informative tool for analyzing and developing structured attention models in computer vision. This work lays the foundation for future advances in explainable AI, spectral diagnostics, and attention-based model compression.
Related papers
- Diagnosing Generalization Failures from Representational Geometry Markers [8.403001493770427]
We study generalization failures inspired by medical biomarkers.<n>We design and test network markers to probe structure, function links, identify prognostic indicators, and validate predictions in real-world settings.<n>This work demonstrates that representational geometry can expose hidden vulnerabilities, offering more robust guidance for model selection and AI interpretability.
arXiv Detail & Related papers (2026-03-02T13:59:19Z) - StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models [98.72926158261937]
We propose a training-free token pruning framework for Visual AutoRegressive models.<n>We employ a lightweight high-pass filter to capture local texture details, while leveraging Principal Component Analysis (PCA) to preserve global structural information.<n>To maintain valid next-scale prediction under sparse tokens, we introduce a nearest neighbor feature propagation strategy.
arXiv Detail & Related papers (2026-03-02T11:35:05Z) - Interpretable Vision Transformers in Monocular Depth Estimation via SVDA [5.8833115420537085]
We introduce SVD-Inspired Attention (SVDA) into the Dense Prediction Transformer (DPT)<n>SVDA decouples directional alignment from spectral modulation by embedding a learnable diagonal matrix into normalized query-key interactions.<n> Experiments on KITTI and NYU-v2 show that SVDA preserves or slightly improves predictive accuracy while adding only minor computational overhead.
arXiv Detail & Related papers (2026-02-11T16:27:15Z) - Understanding Degradation with Vision Language Model [56.09241449206817]
Understanding visual degradations is a critical yet challenging problem in computer vision.<n>We introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning.<n>We also introduce textbfDU-110k, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations.
arXiv Detail & Related papers (2026-02-04T13:51:15Z) - How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing [56.60465182650588]
We introduce three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning.<n>We propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment.<n>We find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models.
arXiv Detail & Related papers (2026-02-02T09:24:45Z) - ASCENT-ViT: Attention-based Scale-aware Concept Learning Framework for Enhanced Alignment in Vision Transformers [29.932706137805713]
ASCENT-ViT is an attention-based, concept learning framework for Vision Transformers (ViTs)<n>It composes scale and position-aware representations from multiscale feature pyramids and ViT patch representations, respectively.<n>It can be utilized as a classification head on top of standard ViT backbones for improved predictive performance and accurate and robust concept explanations.
arXiv Detail & Related papers (2025-01-16T00:45:05Z) - ReViT: Enhancing Vision Transformers Feature Diversity with Attention Residual Connections [8.372189962601077]
Vision Transformer (ViT) self-attention mechanism is characterized by feature collapse in deeper layers.
We propose a novel residual attention learning method for improving ViT-based architectures.
arXiv Detail & Related papers (2024-02-17T14:44:10Z) - Representation Learning in a Decomposed Encoder Design for Bio-inspired Hebbian Learning [5.67478985222587]
We propose a modular framework trained with a bio-inspired variant of contrastive predictive coding, comprising parallel encoders that leverage different invariant visual descriptors as inductive biases.<n>Our findings indicate that this form of inductive bias significantly improves the robustness of learned representations and narrows the performance gap between models.
arXiv Detail & Related papers (2023-11-22T07:58:14Z) - Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal
Structured Representations [70.41385310930846]
We present an end-to-end framework Structure-CLIP to enhance multi-modal structured representations.
We use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations.
A Knowledge-Enhance (KEE) is proposed to leverage SGK as input to further enhance structured representations.
arXiv Detail & Related papers (2023-05-06T03:57:05Z) - Uncovering the Inner Workings of STEGO for Safe Unsupervised Semantic
Segmentation [68.8204255655161]
Self-supervised pre-training strategies have recently shown impressive results for training general-purpose feature extraction backbones in computer vision.
The DINO self-distillation technique has interesting emerging properties, such as unsupervised clustering in the latent space and semantic correspondences of the produced features without using explicit human-annotated labels.
The STEGO method for unsupervised semantic segmentation contrast distills feature correspondences of a DINO-pre-trained Vision Transformer and recently set a new state of the art.
arXiv Detail & Related papers (2023-04-14T15:30:26Z) - Top-Down Visual Attention from Analysis by Synthesis [87.47527557366593]
We consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision.
We propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and controllable achieves top-down attention.
arXiv Detail & Related papers (2023-03-23T05:17:05Z) - Improving Visual Grounding by Encouraging Consistent Gradient-based
Explanations [58.442103936918805]
We show that Attention Mask Consistency produces superior visual grounding results than previous methods.
AMC is effective, easy to implement, and is general as it can be adopted by any vision-language model.
arXiv Detail & Related papers (2022-06-30T17:55:12Z) - Understanding The Robustness in Vision Transformers [140.1090560977082]
Self-attention may promote robustness through improved mid-level representations.
We propose a family of fully attentional networks (FANs) that strengthen this capability.
Our model achieves a state of-the-art 87.1% accuracy and 35.8% mCE on ImageNet-1k and ImageNet-C with 76.8M parameters.
arXiv Detail & Related papers (2022-04-26T17:16:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.