Related papers: Pay Attention to What and Where? Interpretable Feature Extractor in Vision-based Deep Reinforcement Learning

Pay Attention to What and Where? Interpretable Feature Extractor in Vision-based Deep Reinforcement Learning

URL: http://arxiv.org/abs/2504.10071v1
Date: Mon, 14 Apr 2025 10:18:34 GMT
Title: Pay Attention to What and Where? Interpretable Feature Extractor in Vision-based Deep Reinforcement Learning
Authors: Tien Pham, Angelo Cangelosi,
Abstract summary: Current approaches in Explainable Deep Reinforcement Learning have limitations in which the attention mask has a displacement with the objects in visual input.<n>We propose the Interpretable Feature Extractor architecture, aimed at generating an accurate attention mask to illustrate both "what" and "where" the agent concentrates on in the spatial domain.<n>The resulting attention mask is consistent, highly understandable by humans, accurate in spatial dimension, and effectively highlights important objects or locations in visual input.
Score: 2.713322720372114
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current approaches in Explainable Deep Reinforcement Learning have limitations in which the attention mask has a displacement with the objects in visual input. This work addresses a spatial problem within traditional Convolutional Neural Networks (CNNs). We propose the Interpretable Feature Extractor (IFE) architecture, aimed at generating an accurate attention mask to illustrate both "what" and "where" the agent concentrates on in the spatial domain. Our design incorporates a Human-Understandable Encoding module to generate a fully interpretable attention mask, followed by an Agent-Friendly Encoding module to enhance the agent's learning efficiency. These two components together form the Interpretable Feature Extractor for vision-based deep reinforcement learning to enable the model's interpretability. The resulting attention mask is consistent, highly understandable by humans, accurate in spatial dimension, and effectively highlights important objects or locations in visual input. The Interpretable Feature Extractor is integrated into the Fast and Data-efficient Rainbow framework, and evaluated on 57 ATARI games to show the effectiveness of the proposed approach on Spatial Preservation, Interpretability, and Data-efficiency. Finally, we showcase the versatility of our approach by incorporating the IFE into the Asynchronous Advantage Actor-Critic Model.

Related papers

Revisiting Multi-Task Visual Representation Learning [52.93947931352643]
We introduce MTV, a principled multi-task visual pretraining framework.<n>We leverage high-capacity "expert" models to synthesize dense, structured pseudo-labels at scale.<n>Our results demonstrate that MTV achieves "best-of-both-worlds" performance.
arXiv Detail & Related papers (2026-01-20T11:59:19Z)
LLHA-Net: A Hierarchical Attention Network for Two-View Correspondence Learning [33.76961965760301]
We propose a novel method called Layer-by-Layer Hierarchical Attention Network.<n>It enhances the precision of feature point matching in computer vision by addressing the issue of outliers.<n>Our method incorporates stage fusion, hierarchical extraction, and an attention mechanism to improve the network's representation capability.
arXiv Detail & Related papers (2025-12-31T04:25:53Z)
ContextFusion and Bootstrap: An Effective Approach to Improve Slot Attention-Based Object-Centric Learning [53.19029595226767]
Slot attention-based framework has emerged as a leading approach in object-centric learning.<n>Current methods require a stable feature space throughout training to enable reconstruction from slots.<n>We propose a novel ContextFusion stage and a Bootstrap Branch, both of which can be seamlessly integrated into existing slot attention models.
arXiv Detail & Related papers (2025-09-02T07:19:25Z)
Mask & Match: Learning to Recognize Handwritten Math with Self-Supervised Attention [0.19116784879310025]
We present a self-supervised learning framework for recognizing handwritten mathematical expressions (HMER)<n>Our approach begins by pretraining an image encoder using a combination of global and local contrastive loss.<n>A key contribution of this work is a novel self-supervised attention network, which is trained using a progressive spatial masking strategy.
arXiv Detail & Related papers (2025-08-08T08:11:36Z)
Top-Down Compression: Revisit Efficient Vision Token Projection for Visual Instruction Tuning [70.57180215148125]
Visual instruction tuning aims to enable large language models to comprehend the visual world.<n>Existing methods often grapple with the intractable trade-off between accuracy and efficiency.<n>We present LLaVA-Meteor, a novel approach that strategically compresses visual tokens without compromising core information.
arXiv Detail & Related papers (2025-05-17T10:22:29Z)
Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning [81.02648336552421]
We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder.<n>Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder.<n> Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.
arXiv Detail & Related papers (2025-03-23T03:21:33Z)
"Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z)
Point Cloud Understanding via Attention-Driven Contrastive Learning [64.65145700121442]
Transformer-based models have advanced point cloud understanding by leveraging self-attention mechanisms. PointACL is an attention-driven contrastive learning framework designed to address these limitations. Our method employs an attention-driven dynamic masking strategy that guides the model to focus on under-attended regions.
arXiv Detail & Related papers (2024-11-22T05:41:00Z)
Self-supervised Learning via Cluster Distance Prediction for Operating Room Context Awareness [44.15562068190958]
In the Operating Room, semantic segmentation is at the core of creating robots aware of clinical surroundings. State-of-the-art semantic segmentation and activity recognition approaches are fully supervised, which is not scalable. We propose a new 3D self-supervised task for OR scene understanding utilizing OR scene images captured with ToF cameras.
arXiv Detail & Related papers (2024-07-07T17:17:52Z)
Towards Zero-shot Human-Object Interaction Detection via Vision-Language Integration [14.678931157058363]
We propose a novel framework, termed Knowledge Integration to HOI (KI2HOI), that effectively integrates the knowledge of visual-language model to improve zero-shot HOI detection. We develop an effective additive self-attention mechanism to generate more comprehensive visual representations. Our model outperforms the previous methods in various zero-shot and full-supervised settings.
arXiv Detail & Related papers (2024-03-12T02:07:23Z)
Monocular Per-Object Distance Estimation with Masked Object Modeling [33.59920084936913]
Our paper draws inspiration from Masked Image Modeling (MiM) and extends it to multi-object tasks.<n>Our strategy, termed Masked Object Modeling (MoM), enables a novel application of masking techniques.<n>We evaluate the effectiveness of MoM on a novel reference architecture (DistFormer) on the standard KITTI, NuScenes, and MOT Synth datasets.
arXiv Detail & Related papers (2024-01-06T10:56:36Z)
TMHOI: Translational Model for Human-Object Interaction Detection [18.804647133922195]
We propose an innovative graph-based approach to detect human-object interactions (HOIs) Our method effectively captures the sentiment representation of HOIs by integrating both spatial and semantic knowledge. Our approach outperformed existing state-of-the-art graph-based methods by a significant margin.
arXiv Detail & Related papers (2023-03-07T21:52:10Z)
AASeg: Attention Aware Network for Real Time Semantic Segmentation [0.0]
We propose AASeg, a novel Attention-Aware Network for real-time semantic segmentation.<n>We show that AASeg achieves a compelling trade-off between accuracy and efficiency, outperforming prior real-time methods.
arXiv Detail & Related papers (2021-07-27T20:01:55Z)
Learning to Relate Depth and Semantics for Unsupervised Domain Adaptation [87.1188556802942]
We present an approach for encoding visual task relationships to improve model performance in an Unsupervised Domain Adaptation (UDA) setting. We propose a novel Cross-Task Relation Layer (CTRL), which encodes task dependencies between the semantic and depth predictions. Furthermore, we propose an Iterative Self-Learning (ISL) training scheme, which exploits semantic pseudo-labels to provide extra supervision on the target domain.
arXiv Detail & Related papers (2021-05-17T13:42:09Z)
MaAST: Map Attention with Semantic Transformersfor Efficient Visual Navigation [4.127128889779478]
This work focuses on performing better or comparable to the existing learning-based solutions for visual navigation for autonomous agents. We propose a method to encode vital scene semantics into a semantically informed, top-down egocentric map representation. We conduct experiments on 3-D reconstructed indoor PointGoal visual navigation and demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-03-21T12:01:23Z)
Variational Structured Attention Networks for Deep Visual Representation Learning [49.80498066480928]
We propose a unified deep framework to jointly learn both spatial attention maps and channel attention in a principled manner. Specifically, we integrate the estimation and the interaction of the attentions within a probabilistic representation learning framework. We implement the inference rules within the neural network, thus allowing for end-to-end learning of the probabilistic and the CNN front-end parameters.
arXiv Detail & Related papers (2021-03-05T07:37:24Z)
Heterogeneous Contrastive Learning: Encoding Spatial Information for Compact Visual Representations [183.03278932562438]
This paper presents an effective approach that adds spatial information to the encoding stage to alleviate the learning inconsistency between the contrastive objective and strong data augmentation operations. We show that our approach achieves higher efficiency in visual representations and thus delivers a key message to inspire the future research of self-supervised visual representation learning.
arXiv Detail & Related papers (2020-11-19T16:26:25Z)
Object-Centric Learning with Slot Attention [43.684193749891506]
We present the Slot Attention module, an architectural component that interfaces with perceptual representations. Slot Attention produces task-dependent abstract representations which we call slots. We empirically demonstrate that Slot Attention can extract object-centric representations that enable generalization to unseen compositions.
arXiv Detail & Related papers (2020-06-26T15:31:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.