Related papers: LICO: Explainable Models with Language-Image Consistency

LICO: Explainable Models with Language-Image Consistency

URL: http://arxiv.org/abs/2310.09821v1
Date: Sun, 15 Oct 2023 12:44:33 GMT
Title: LICO: Explainable Models with Language-Image Consistency
Authors: Yiming Lei, Zilong Li, Yangyang Li, Junping Zhang, Hongming Shan
Abstract summary: This paper develops a Language-Image COnsistency model for explainable image classification, termed LICO. We first establish a coarse global manifold structure alignment by minimizing the distance between the distributions of image and language features. We then achieve fine-grained saliency maps by applying optimal transport (OT) theory to assign local feature maps with class-specific prompts.
Score: 39.869639626266554
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Interpreting the decisions of deep learning models has been actively studied since the explosion of deep neural networks. One of the most convincing interpretation approaches is salience-based visual interpretation, such as Grad-CAM, where the generation of attention maps depends merely on categorical labels. Although existing interpretation methods can provide explainable decision clues, they often yield partial correspondence between image and saliency maps due to the limited discriminative information from one-hot labels. This paper develops a Language-Image COnsistency model for explainable image classification, termed LICO, by correlating learnable linguistic prompts with corresponding visual features in a coarse-to-fine manner. Specifically, we first establish a coarse global manifold structure alignment by minimizing the distance between the distributions of image and language features. We then achieve fine-grained saliency maps by applying optimal transport (OT) theory to assign local feature maps with class-specific prompts. Extensive experimental results on eight benchmark datasets demonstrate that the proposed LICO achieves a significant improvement in generating more explainable attention maps in conjunction with existing interpretation methods such as Grad-CAM. Remarkably, LICO improves the classification performance of existing models without introducing any computational overhead during inference. Source code is made available at https://github.com/ymLeiFDU/LICO.

Related papers

CLIP-Clique: Graph-based Correspondence Matching Augmented by Vision Language Models for Object-based Global Localization [0.0]
One of the most promising approaches for localization on object maps is to use semantic graph matching. To address the former issue, we augment the correspondence matching using Vision Language Models. In addition, inliers are estimated deterministically using a graph-theoretic approach.
arXiv Detail & Related papers (2024-10-04T00:23:20Z)
Spatial Action Unit Cues for Interpretable Deep Facial Expression Recognition [55.97779732051921]
State-of-the-art classifiers for facial expression recognition (FER) lack interpretability, an important feature for end-users. A new learning strategy is proposed to explicitly incorporate AU cues into classifier training, allowing to train deep interpretable models. Our new strategy is generic, and can be applied to any deep CNN- or transformer-based classifier without requiring any architectural change or significant additional training time.
arXiv Detail & Related papers (2024-10-01T10:42:55Z)
Interpretable Network Visualizations: A Human-in-the-Loop Approach for Post-hoc Explainability of CNN-based Image Classification [5.087579454836169]
State-of-the-art explainability methods generate saliency maps to show where a specific class is identified. We introduce a post-hoc method that explains the entire feature extraction process of a Convolutional Neural Network. We also show an approach to generate global explanations by aggregating labels across multiple images.
arXiv Detail & Related papers (2024-05-06T09:21:35Z)
Guided Interpretable Facial Expression Recognition via Spatial Action Unit Cues [55.97779732051921]
A new learning strategy is proposed to explicitly incorporate au cues into classifier training. We show that our strategy can improve layer-wise interpretability without degrading classification performance.
arXiv Detail & Related papers (2024-02-01T02:13:49Z)
EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models [52.3015009878545]
We develop an image segmentor capable of generating fine-grained segmentation maps without any additional training. Our framework identifies semantic correspondences between image pixels and spatial locations of low-dimensional feature maps. In extensive experiments, the produced segmentation maps are demonstrated to be well delineated and capture detailed parts of the images.
arXiv Detail & Related papers (2024-01-22T07:34:06Z)
Feature Activation Map: Visual Explanation of Deep Learning Models for Image Classification [17.373054348176932]
In this work, a post-hoc interpretation tool named feature activation map (FAM) is proposed. FAM can interpret deep learning models without FC layers as a classifier. Experiments conducted on ten deep learning models for few-shot image classification, contrastive learning image classification and image retrieval tasks demonstrate the effectiveness of the proposed FAM algorithm.
arXiv Detail & Related papers (2023-07-11T05:33:46Z)
Decom--CAM: Tell Me What You See, In Details! Feature-Level Interpretation via Decomposition Class Activation Map [23.71680014689873]
Class Activation Map (CAM) is widely used to interpret deep model predictions by highlighting object location. This paper proposes a new two-stage interpretability method called the Decomposition Class Activation Map (Decom-CAM) Our experiments demonstrate that the proposed Decom-CAM outperforms current state-of-the-art methods significantly.
arXiv Detail & Related papers (2023-05-27T14:33:01Z)
Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation. We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths. In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z)
Region Comparison Network for Interpretable Few-shot Image Classification [97.97902360117368]
Few-shot image classification has been proposed to effectively use only a limited number of labeled examples to train models for new classes. We propose a metric learning based method named Region Comparison Network (RCN), which is able to reveal how few-shot learning works. We also present a new way to generalize the interpretability from the level of tasks to categories.
arXiv Detail & Related papers (2020-09-08T07:29:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.