Related papers: TextCAM: Explaining Class Activation Map with Text

TextCAM: Explaining Class Activation Map with Text

URL: http://arxiv.org/abs/2510.01004v1
Date: Wed, 01 Oct 2025 15:11:14 GMT
Title: TextCAM: Explaining Class Activation Map with Text
Authors: Qiming Zhao, Xingjian Li, Xiaoyu Cao, Xiaolong Wu, Min Xu,
Abstract summary: This paper focuses on deep vision models, for which a dominant line of explainability methods are Class Activation Mapping (CAM) and its variants.<n>We propose TextCAM, a novel explanation framework that enriches CAM with natural languages.<n>We derive channel-level semantic representations using CLIP embeddings and linear discriminant analysis, and aggregate them with CAM weights.<n>This yields explanations that jointly specify where the model attends and what visual attributes likely support its decision.
Score: 24.927593721256077
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep neural networks (DNNs) have achieved remarkable success across domains but remain difficult to interpret, limiting their trustworthiness in high-stakes applications. This paper focuses on deep vision models, for which a dominant line of explainability methods are Class Activation Mapping (CAM) and its variants working by highlighting spatial regions that drive predictions. We figure out that CAM provides little semantic insight into what attributes underlie these activations. To address this limitation, we propose TextCAM, a novel explanation framework that enriches CAM with natural languages. TextCAM combines the precise spatial localization of CAM with the semantic alignment of vision-language models (VLMs). Specifically, we derive channel-level semantic representations using CLIP embeddings and linear discriminant analysis, and aggregate them with CAM weights to produce textual descriptions of salient visual evidence. This yields explanations that jointly specify where the model attends and what visual attributes likely support its decision. We further extend TextCAM to generate feature channels into semantically coherent groups, enabling more fine-grained visual-textual explanations. Experiments on ImageNet, CLEVR, and CUB demonstrate that TextCAM produces faithful and interpretable rationales that improve human understanding, detect spurious correlations, and preserve model fidelity.

Related papers

Integrative CAM: Adaptive Layer Fusion for Comprehensive Interpretation of CNNs [2.58561853556421]
Integrative CAM provides a holistic view of feature importance across Convolutional Neural Networks (CNNs)<n>Traditional gradient-based CAM methods, such as Grad-CAM and Grad-CAM++, primarily use final layer activations to highlight regions of interest.<n>We generalize the alpha term from Grad-CAM++ to apply to any smooth function, expanding CAM applicability across a wider range of models.
arXiv Detail & Related papers (2024-12-02T10:33:34Z)
Hierarchical Graph Interaction Transformer with Dynamic Token Clustering for Camouflaged Object Detection [57.883265488038134]
We propose a hierarchical graph interaction network termed HGINet for camouflaged object detection. The network is capable of discovering imperceptible objects via effective graph interaction among the hierarchical tokenized features. Our experiments demonstrate the superior performance of HGINet compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2024-08-27T12:53:25Z)
DecomCAM: Advancing Beyond Saliency Maps through Decomposition and Integration [25.299607743268993]
Class Activation Map (CAM) methods highlight regions revealing the model's decision-making basis but lack clear saliency maps and detailed interpretability. We propose DecomCAM, a novel decomposition-and-integration method that distills shared patterns from channel activation maps. Experiments reveal that DecomCAM not only excels in locating accuracy but also achieves an optimizing balance between interpretability and computational efficiency.
arXiv Detail & Related papers (2024-05-29T08:40:11Z)
CLIM: Contrastive Language-Image Mosaic for Region Representation [58.05870131126816]
Contrastive Language-Image Mosaic (CLIM) is a novel approach for aligning region and text representations. CLIM consistently improves different open-vocabulary object detection methods. It can effectively enhance the region representation of vision-language models.
arXiv Detail & Related papers (2023-12-18T17:39:47Z)
RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal. We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z)
Exploit CAM by itself: Complementary Learning System for Weakly Supervised Semantic Segmentation [59.24824050194334]
This paper turns to an interesting working mechanism in agent learning named Complementary Learning System ( CLS) Motivated by this simple but effective learning pattern, we propose a General-Specific Learning Mechanism (GSLM) GSLM develops a General Learning Module (GLM) and a Specific Learning Module (SLM)
arXiv Detail & Related papers (2023-03-04T16:16:47Z)
VS-CAM: Vertex Semantic Class Activation Mapping to Interpret Vision Graph Neural Network [10.365366151667017]
Graph convolutional neural network (GCN) has drawn increasing attention and attained good performance in various computer vision tasks. For standard convolutional neural networks (CNNs), class activation mapping (CAM) methods are commonly used to visualize the connection between CNN's decision and image region by generating a heatmap. In this paper, we proposed a novel visualization method particularly applicable to GCN, Vertex Semantic Class Activation Mapping (VS-CAM)
arXiv Detail & Related papers (2022-09-15T09:45:59Z)
Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts. We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z)
Use HiResCAM instead of Grad-CAM for faithful explanations of convolutional neural networks [89.56292219019163]
Explanation methods facilitate the development of models that learn meaningful concepts and avoid exploiting spurious correlations. We illustrate a previously unrecognized limitation of the popular neural network explanation method Grad-CAM. We propose HiResCAM, a class-specific explanation method that is guaranteed to highlight only the locations the model used to make each prediction.
arXiv Detail & Related papers (2020-11-17T19:26:14Z)
Eigen-CAM: Class Activation Map using Principal Components [1.2691047660244335]
This paper builds on previous ideas to cope with the increasing demand for interpretable, robust, and transparent models. The proposed Eigen-CAM computes and visualizes the principle components of the learned features/representations from the convolutional layers.
arXiv Detail & Related papers (2020-08-01T17:14:13Z)
SS-CAM: Smoothed Score-CAM for Sharper Visual Feature Localization [1.3381749415517021]
We introduce an enhanced visual explanation in terms of visual sharpness called SS-CAM. We evaluate our method on the ILSVRC 2012 Validation dataset, which outperforms Score-CAM on both faithfulness and localization tasks.
arXiv Detail & Related papers (2020-06-25T08:51:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.