Related papers: Interpretable Vision Transformers in Image Classification via SVDA

Interpretable Vision Transformers in Image Classification via SVDA

URL: http://arxiv.org/abs/2602.10994v1
Date: Wed, 11 Feb 2026 16:20:32 GMT
Title: Interpretable Vision Transformers in Image Classification via SVDA
Authors: Vasileios Arampatzakis, George Pavlidis, Nikolaos Mitianoudis, Nikos Papamarkos,
Abstract summary: Vision Transformers (ViTs) have achieved state-of-the-art performance in image classification, yet their attention mechanisms often remain opaque and exhibit dense, non-structured behaviors.<n>We adapt our previously proposed SVD-Inspired Attention (SVDA) mechanism to the ViT architecture, introducing a geometrically grounded formulation that enhances interpretability, sparsity, and spectral structure.
Score: 5.8833115420537085
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Vision Transformers (ViTs) have achieved state-of-the-art performance in image classification, yet their attention mechanisms often remain opaque and exhibit dense, non-structured behaviors. In this work, we adapt our previously proposed SVD-Inspired Attention (SVDA) mechanism to the ViT architecture, introducing a geometrically grounded formulation that enhances interpretability, sparsity, and spectral structure. We apply the use of interpretability indicators -- originally proposed with SVDA -- to monitor attention dynamics during training and assess structural properties of the learned representations. Experimental evaluations on four widely used benchmarks -- CIFAR-10, FashionMNIST, CIFAR-100, and ImageNet-100 -- demonstrate that SVDA consistently yields more interpretable attention patterns without sacrificing classification accuracy. While the current framework offers descriptive insights rather than prescriptive guidance, our results establish SVDA as a comprehensive and informative tool for analyzing and developing structured attention models in computer vision. This work lays the foundation for future advances in explainable AI, spectral diagnostics, and attention-based model compression.

Related papers

Diagnosing Generalization Failures from Representational Geometry Markers [8.403001493770427]
We study generalization failures inspired by medical biomarkers.<n>We design and test network markers to probe structure, function links, identify prognostic indicators, and validate predictions in real-world settings.<n>This work demonstrates that representational geometry can expose hidden vulnerabilities, offering more robust guidance for model selection and AI interpretability.
arXiv Detail & Related papers (2026-03-02T13:59:19Z)
StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models [98.72926158261937]
We propose a training-free token pruning framework for Visual AutoRegressive models.<n>We employ a lightweight high-pass filter to capture local texture details, while leveraging Principal Component Analysis (PCA) to preserve global structural information.<n>To maintain valid next-scale prediction under sparse tokens, we introduce a nearest neighbor feature propagation strategy.
arXiv Detail & Related papers (2026-03-02T11:35:05Z)
Interpretable Vision Transformers in Monocular Depth Estimation via SVDA [5.8833115420537085]
We introduce SVD-Inspired Attention (SVDA) into the Dense Prediction Transformer (DPT)<n>SVDA decouples directional alignment from spectral modulation by embedding a learnable diagonal matrix into normalized query-key interactions.<n> Experiments on KITTI and NYU-v2 show that SVDA preserves or slightly improves predictive accuracy while adding only minor computational overhead.
arXiv Detail & Related papers (2026-02-11T16:27:15Z)
Understanding Degradation with Vision Language Model [56.09241449206817]
Understanding visual degradations is a critical yet challenging problem in computer vision.<n>We introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning.<n>We also introduce textbfDU-110k, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations.
arXiv Detail & Related papers (2026-02-04T13:51:15Z)
How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing [56.60465182650588]
We introduce three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning.<n>We propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment.<n>We find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models.
arXiv Detail & Related papers (2026-02-02T09:24:45Z)
ASCENT-ViT: Attention-based Scale-aware Concept Learning Framework for Enhanced Alignment in Vision Transformers [29.932706137805713]
ASCENT-ViT is an attention-based, concept learning framework for Vision Transformers (ViTs)<n>It composes scale and position-aware representations from multiscale feature pyramids and ViT patch representations, respectively.<n>It can be utilized as a classification head on top of standard ViT backbones for improved predictive performance and accurate and robust concept explanations.
arXiv Detail & Related papers (2025-01-16T00:45:05Z)
ReViT: Enhancing Vision Transformers Feature Diversity with Attention Residual Connections [8.372189962601077]
Vision Transformer (ViT) self-attention mechanism is characterized by feature collapse in deeper layers. We propose a novel residual attention learning method for improving ViT-based architectures.
arXiv Detail & Related papers (2024-02-17T14:44:10Z)
Representation Learning in a Decomposed Encoder Design for Bio-inspired Hebbian Learning [5.67478985222587]
We propose a modular framework trained with a bio-inspired variant of contrastive predictive coding, comprising parallel encoders that leverage different invariant visual descriptors as inductive biases.<n>Our findings indicate that this form of inductive bias significantly improves the robustness of learned representations and narrows the performance gap between models.
arXiv Detail & Related papers (2023-11-22T07:58:14Z)
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations [70.41385310930846]
We present an end-to-end framework Structure-CLIP to enhance multi-modal structured representations. We use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations. A Knowledge-Enhance (KEE) is proposed to leverage SGK as input to further enhance structured representations.
arXiv Detail & Related papers (2023-05-06T03:57:05Z)
Uncovering the Inner Workings of STEGO for Safe Unsupervised Semantic Segmentation [68.8204255655161]
Self-supervised pre-training strategies have recently shown impressive results for training general-purpose feature extraction backbones in computer vision. The DINO self-distillation technique has interesting emerging properties, such as unsupervised clustering in the latent space and semantic correspondences of the produced features without using explicit human-annotated labels. The STEGO method for unsupervised semantic segmentation contrast distills feature correspondences of a DINO-pre-trained Vision Transformer and recently set a new state of the art.
arXiv Detail & Related papers (2023-04-14T15:30:26Z)
Top-Down Visual Attention from Analysis by Synthesis [87.47527557366593]
We consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision. We propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and controllable achieves top-down attention.
arXiv Detail & Related papers (2023-03-23T05:17:05Z)
Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations [58.442103936918805]
We show that Attention Mask Consistency produces superior visual grounding results than previous methods. AMC is effective, easy to implement, and is general as it can be adopted by any vision-language model.
arXiv Detail & Related papers (2022-06-30T17:55:12Z)
Understanding The Robustness in Vision Transformers [140.1090560977082]
Self-attention may promote robustness through improved mid-level representations. We propose a family of fully attentional networks (FANs) that strengthen this capability. Our model achieves a state of-the-art 87.1% accuracy and 35.8% mCE on ImageNet-1k and ImageNet-C with 76.8M parameters.
arXiv Detail & Related papers (2022-04-26T17:16:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.