GIFT: A Framework for Global Interpretable Faithful Textual Explanations of Vision Classifiers
- URL: http://arxiv.org/abs/2411.15605v2
- Date: Mon, 03 Feb 2025 14:36:25 GMT
- Title: GIFT: A Framework for Global Interpretable Faithful Textual Explanations of Vision Classifiers
- Authors: Éloi Zablocki, Valentin Gerard, Amaia Cardiel, Eric Gaussier, Matthieu Cord, Eduardo Valle,
- Abstract summary: GIFT is a framework for deriving post-hoc, global, interpretable, and faithful textual explanations for vision classifiers.<n>We show that GIFT effectively reveals meaningful insights, uncovering tasks, concepts, and biases used by deep vision classifiers.
- Score: 47.197438852068146
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Understanding deep models is crucial for deploying them in safety-critical applications. We introduce GIFT, a framework for deriving post-hoc, global, interpretable, and faithful textual explanations for vision classifiers. GIFT starts from local faithful visual counterfactual explanations and employs (vision) language models to translate those into global textual explanations. Crucially, GIFT provides a verification stage measuring the causal effect of the proposed explanations on the classifier decision. Through experiments across diverse datasets, including CLEVR, CelebA, and BDD, we demonstrate that GIFT effectively reveals meaningful insights, uncovering tasks, concepts, and biases used by deep vision classifiers. The framework is released at https://github.com/valeoai/GIFT.
Related papers
- ContextualSHAP : Enhancing SHAP Explanations Through Contextual Language Generation [0.0]
We propose a Python package that extends SHAP by integrating it with a large language model (LLM)<n>This integration is guided by user-defined parameters to tailor the explanation to both the model context and the user perspective.<n>Results: generated explanations were perceived as more understandable and contextually appropriate compared to visual-only outputs.
arXiv Detail & Related papers (2025-12-08T05:18:15Z) - DEXTER: Diffusion-Guided EXplanations with TExtual Reasoning for Vision Models [49.25757423776323]
DEXTER is a data-free framework that generates global, textual explanations of visual classifiers.<n>We show that DEXTER produces accurate, interpretable outputs.<n> Experiments on ImageNet, Waterbirds, CelebA, and FairFaces confirm that DEXTER outperforms existing approaches in global model explanation and class-level bias reporting.
arXiv Detail & Related papers (2025-10-16T14:43:25Z) - G-Refer: Graph Retrieval-Augmented Large Language Model for Explainable Recommendation [48.23263809469786]
We propose a framework using graph retrieval-augmented large language models (LLMs) for explainable recommendation.
G-Refer achieves superior performance compared with existing methods in both explainability and stability.
arXiv Detail & Related papers (2025-02-18T06:42:38Z) - MEGL: Multimodal Explanation-Guided Learning [23.54169888224728]
We propose a novel Multimodal Explanation-Guided Learning (MEGL) framework to enhance model interpretability and improve classification performance.
Our Saliency-Driven Textual Grounding (SDTG) approach integrates spatial information from visual explanations into textual rationales, providing spatially grounded and contextually rich explanations.
We validate MEGL on two new datasets, Object-ME and Action-ME, for image classification with multimodal explanations.
arXiv Detail & Related papers (2024-11-20T05:57:00Z) - Bridging Local Details and Global Context in Text-Attributed Graphs [62.522550655068336]
GraphBridge is a framework that bridges local and global perspectives by leveraging contextual textual information.
Our method achieves state-of-theart performance, while our graph-aware token reduction module significantly enhances efficiency and solves scalability issues.
arXiv Detail & Related papers (2024-06-18T13:35:25Z) - Diffexplainer: Towards Cross-modal Global Explanations with Diffusion Models [51.21351775178525]
DiffExplainer is a novel framework that, leveraging language-vision models, enables multimodal global explainability.
It employs diffusion models conditioned on optimized text prompts, synthesizing images that maximize class outputs.
The analysis of generated visual descriptions allows for automatic identification of biases and spurious features.
arXiv Detail & Related papers (2024-04-03T10:11:22Z) - SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context
Misinformation Detection [18.356648843815627]
Out-of-context (OOC) misinformation is one of the easiest and most effective ways to mislead audiences.
Current methods focus on assessing image-text consistency but lack convincing explanations for their judgments.
We introduce SNIFFER, a novel multimodal large language model specifically engineered for OOC misinformation detection and explanation.
arXiv Detail & Related papers (2024-03-05T18:04:59Z) - General Object Foundation Model for Images and Videos at Scale [99.2806103051613]
We present GLEE, an object-level foundation model for locating and identifying objects in images and videos.
GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario.
We employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks.
arXiv Detail & Related papers (2023-12-14T17:26:00Z) - GlobalDoc: A Cross-Modal Vision-Language Framework for Real-World Document Image Retrieval and Classification [8.880856137902947]
We introduce GlobalDoc, a cross-modal transformer-based architecture pre-trained in a self-supervised manner.
GlobalDoc improves the learning of richer semantic concepts by unifying language and visual representations.
For proper evaluation, we also propose two novel document-level downstream VDU tasks, Few-Shot Document Image Classification (DIC) and Content-based Document Image Retrieval (DIR)
arXiv Detail & Related papers (2023-09-11T18:35:14Z) - GLOBE-CE: A Translation-Based Approach for Global Counterfactual
Explanations [10.276136171459731]
Global & Efficient Counterfactual Explanations (GLOBE-CE) is a flexible framework that tackles the reliability and scalability issues associated with current state-of-the-art.
We provide a unique mathematical analysis of categorical feature translations, utilising it in our method.
Experimental evaluation with publicly available datasets and user studies demonstrate that GLOBE-CE performs significantly better than the current state-of-the-art.
arXiv Detail & Related papers (2023-05-26T15:26:59Z) - LANDMARK: Language-guided Representation Enhancement Framework for Scene
Graph Generation [34.40862385518366]
Scene graph generation (SGG) is a sophisticated task that suffers from both complex visual features and dataset longtail problem.
We propose LANDMARK (LANguage-guiDed representationenhanceMent frAmewoRK) that learns predicate-relevant representations from language-vision interactive patterns.
This framework is model-agnostic and consistently improves performance on existing SGG models.
arXiv Detail & Related papers (2023-03-02T09:03:11Z) - Towards Understanding Sample Variance in Visually Grounded Language
Generation: Evaluations and Observations [67.4375210552593]
We design experiments to understand an important but often ignored problem in visually grounded language generation.
Given that humans have different utilities and visual attention, how will the sample variance in multi-reference datasets affect the models' performance?
We show that it is of paramount importance to report variance in experiments; that human-generated references could vary drastically in different datasets/tasks, revealing the nature of each task.
arXiv Detail & Related papers (2020-10-07T20:45:14Z) - GINet: Graph Interaction Network for Scene Parsing [58.394591509215005]
We propose a Graph Interaction unit (GI unit) and a Semantic Context Loss (SC-loss) to promote context reasoning over image regions.
The proposed GINet outperforms the state-of-the-art approaches on the popular benchmarks, including Pascal-Context and COCO Stuff.
arXiv Detail & Related papers (2020-09-14T02:52:45Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.