Learning to Look: Cognitive Attention Alignment with Vision-Language Models
- URL: http://arxiv.org/abs/2509.21247v1
- Date: Thu, 25 Sep 2025 14:40:48 GMT
- Title: Learning to Look: Cognitive Attention Alignment with Vision-Language Models
- Authors: Ryan L. Yang, Dipkamal Bhusal, Nidhi Rastogi,
- Abstract summary: Convolutional Neural Networks (CNNs) frequently "cheat" by exploiting superficial correlations.<n>Recent methods have sought to guide model attention using concept-based supervision and explanation regularization.<n>We propose a scalable framework that leverages vision-language models to automatically generate semantic attention maps.
- Score: 2.676349883103404
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Convolutional Neural Networks (CNNs) frequently "cheat" by exploiting superficial correlations, raising concerns about whether they make predictions for the right reasons. Inspired by cognitive science, which highlights the role of attention in robust human perception, recent methods have sought to guide model attention using concept-based supervision and explanation regularization. However, these techniques depend on labor-intensive, expert-provided annotations, limiting their scalability. We propose a scalable framework that leverages vision-language models to automatically generate semantic attention maps using natural language prompts. By introducing an auxiliary loss that aligns CNN attention with these language-guided maps, our approach promotes more reliable and cognitively plausible decision-making without manual annotation. Experiments on challenging datasets, ColoredMNIST and DecoyMNIST, show that our method achieves state-of-the-art performance on ColorMNIST and remains competitive with annotation-heavy baselines on DecoyMNIST, demonstrating improved generalization, reduced shortcut reliance, and model attention that better reflects human intuition.
Related papers
- A Resource-Rational Principle for Modeling Visual Attention Control [13.330522631439917]
dissertation develops a resource-rational, simulation-based framework for modeling visual attention.<n>I formalize visual tasks as bounded-optimal control problems using Partially Observable Markov Decision Processes.<n>These models are instantiated in simulation environments spanning traditional text reading and reading-while-walking with smart glasses.
arXiv Detail & Related papers (2026-03-02T16:45:50Z) - Toward Cognitive Supersensing in Multimodal Large Language Model [67.15559571626747]
We introduce Cognitive Supersensing, a training paradigm that endows MLLMs with human-like visual imagery capabilities.<n>In experiments, MLLMs trained with Cognitive Supersensing significantly outperform state-of-the-art baselines on CogSense-Bench.<n>We will open-source the CogSense-Bench and our model weights.
arXiv Detail & Related papers (2026-02-02T02:19:50Z) - Cognitively-Inspired Emergent Communication via Knowledge Graphs for Assisting the Visually Impaired [8.182196998385583]
We introduce a novel framework, Cognitively-Inspired Emergent Communication via Knowledge Graphs (VAG-EC), which emulates human visual perception and cognitive mapping.<n>Our method constructs knowledge graphs to represent objects and their relationships, incorporating attention mechanisms to prioritize task-relevant entities, thereby mirroring human selective attention.<n>This structured approach enables the emergence of compact, interpretable, and context-sensitive symbolic languages.
arXiv Detail & Related papers (2025-05-28T08:09:06Z) - Beyond RNNs: Benchmarking Attention-Based Image Captioning Models [0.0]
This study benchmarks the performance of attention-based image captioning models against RNN-based approaches.<n>We evaluate the effectiveness of Bahdanau attention in enhancing the alignment between image features and generated captions.<n>Our results show that attention-based models outperform RNNs in generating more accurate and semantically rich captions.
arXiv Detail & Related papers (2025-02-26T01:05:18Z) - Collapsed Language Models Promote Fairness [88.48232731113306]
We find that debiased language models exhibit collapsed alignment between token representations and word embeddings.<n>We design a principled fine-tuning method that can effectively improve fairness in a wide range of debiasing methods.
arXiv Detail & Related papers (2024-10-06T13:09:48Z) - Tuning-Free Accountable Intervention for LLM Deployment -- A
Metacognitive Approach [55.613461060997004]
Large Language Models (LLMs) have catalyzed transformative advances across a spectrum of natural language processing tasks.
We propose an innovative textitmetacognitive approach, dubbed textbfCLEAR, to equip LLMs with capabilities for self-aware error identification and correction.
arXiv Detail & Related papers (2024-03-08T19:18:53Z) - Interpreting Pretrained Language Models via Concept Bottlenecks [55.47515772358389]
Pretrained language models (PLMs) have made significant strides in various natural language processing tasks.
The lack of interpretability due to their black-box'' nature poses challenges for responsible implementation.
We propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans.
arXiv Detail & Related papers (2023-11-08T20:41:18Z) - Semantic Interpretation and Validation of Graph Attention-based
Explanations for GNN Models [9.260186030255081]
We propose a methodology for investigating the use of semantic attention to enhance the explainability of Graph Neural Network (GNN)-based models.
Our work extends existing attention-based graph explainability methods by analysing the divergence in the attention distributions in relation to semantically sorted feature sets.
We apply our methodology on a lidar pointcloud estimation model successfully identifying key semantic classes that contribute to enhanced performance.
arXiv Detail & Related papers (2023-08-08T12:34:32Z) - Learnable Visual Words for Interpretable Image Recognition [70.85686267987744]
We propose the Learnable Visual Words (LVW) to interpret the model prediction behaviors with two novel modules.
The semantic visual words learning relaxes the category-specific constraint, enabling the general visual words shared across different categories.
Our experiments on six visual benchmarks demonstrate the superior effectiveness of our proposed LVW in both accuracy and model interpretation.
arXiv Detail & Related papers (2022-05-22T03:24:45Z) - Variational Structured Attention Networks for Deep Visual Representation
Learning [49.80498066480928]
We propose a unified deep framework to jointly learn both spatial attention maps and channel attention in a principled manner.
Specifically, we integrate the estimation and the interaction of the attentions within a probabilistic representation learning framework.
We implement the inference rules within the neural network, thus allowing for end-to-end learning of the probabilistic and the CNN front-end parameters.
arXiv Detail & Related papers (2021-03-05T07:37:24Z) - Proactive Pseudo-Intervention: Causally Informed Contrastive Learning
For Interpretable Vision Models [103.64435911083432]
We present a novel contrastive learning strategy called it Proactive Pseudo-Intervention (PPI)
PPI leverages proactive interventions to guard against image features with no causal relevance.
We also devise a novel causally informed salience mapping module to identify key image pixels to intervene, and show it greatly facilitates model interpretability.
arXiv Detail & Related papers (2020-12-06T20:30:26Z) - Deep Reinforced Attention Learning for Quality-Aware Visual Recognition [73.15276998621582]
We build upon the weakly-supervised generation mechanism of intermediate attention maps in any convolutional neural networks.
We introduce a meta critic network to evaluate the quality of attention maps in the main network.
arXiv Detail & Related papers (2020-07-13T02:44:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.