Related papers: GLIMPSE: Holistic Cross-Modal Explainability for Large Vision-Language Models

GLIMPSE: Holistic Cross-Modal Explainability for Large Vision-Language Models

URL: http://arxiv.org/abs/2506.18985v3
Date: Tue, 29 Jul 2025 17:59:59 GMT
Title: GLIMPSE: Holistic Cross-Modal Explainability for Large Vision-Language Models
Authors: Guanxi Shen,
Abstract summary: We introduce GLIMPSE, a model-agnostic framework that jointly attributes LVLM outputs to the most relevant visual evidence and textual signals.<n>GLIMPSE fuses gradient-weighted attention, adaptive layer propagation, and relevance-weighted token aggregation to produce holistic response-level heat maps.<n>We demonstrate an analytic approach to uncover fine-grained insights into LVLM cross-modal attribution, trace reasoning dynamics, analyze systematic misalignment, diagnose hallucination and bias, and ensure transparency.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent large vision-language models (LVLMs) have advanced capabilities in visual question answering (VQA). However, interpreting where LVLMs direct their visual attention remains a significant challenge, yet is essential for understanding model behavior. We introduce GLIMPSE (Gradient-Layer Importance Mapping for Prompted Visual Saliency Explanation), a lightweight, model-agnostic framework that jointly attributes LVLM outputs to the most relevant visual evidence and textual signals that support open-ended generation. GLIMPSE fuses gradient-weighted attention, adaptive layer propagation, and relevance-weighted token aggregation to produce holistic response-level heat maps for interpreting cross-modal reasoning, outperforming prior methods in faithfulness and pushing the state-of-the-art in human-attention alignment. We demonstrate an analytic approach to uncover fine-grained insights into LVLM cross-modal attribution, trace reasoning dynamics, analyze systematic misalignment, diagnose hallucination and bias, and ensure transparency.

Related papers

CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models [60.0300765815417]
Large Vision-Language Models (LVLMs) frequently produce content that deviates from visual information, leading to object hallucination.<n>We propose Caption-sensitive Attention Intervention (CAI), a training-free, plug-and-play hallucination mitigation method.
arXiv Detail & Related papers (2025-06-30T07:52:36Z)
SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding [5.976839106353883]
SECOND: Selective and Contrastive Decoding is a novel approach that enables Vision-Language Models to leverage multi-scale visual information with an object-centric manner.<n> SECOND significantly reduces perceptual hallucinations and outperforms a wide range of benchmarks.
arXiv Detail & Related papers (2025-06-10T02:55:38Z)
Aligning Attention Distribution to Information Flow for Hallucination Mitigation in Large Vision-Language Models [11.385588803559733]
We enhance the model's visual understanding by leveraging the core information embedded in semantic representations.<n>We evaluate our method on three image captioning benchmarks using five different LVLMs, demonstrating its effectiveness in significantly reducing hallucinations.
arXiv Detail & Related papers (2025-05-20T12:10:13Z)
Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis [21.869968563545736]
We define implicit visual misunderstanding (IVM), where MLLMs provide correct answers without fully comprehending the visual input.<n>We introduce a scale-agnostic metric, textitattention accuracy, and a novel benchmark for quantifying IVMs.<n>We extend our approach to finer granularities and demonstrate its effectiveness in unimodal scenarios.
arXiv Detail & Related papers (2025-05-15T17:52:40Z)
VLEER: Vision and Language Embeddings for Explainable Whole Slide Image Representation [3.695317701129061]
We introduce Vision and Language Embeddings for Explainable WSI Representation (VLEER), a novel method designed to leverage vision features for WSI representation.<n>VLEER offers the unique advantage of interpretability, enabling direct human-readable insights into the results.
arXiv Detail & Related papers (2025-02-28T08:49:03Z)
Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models [66.71616369573715]
Large Vision-Language Models (LVLMs) are prone to generating hallucinatory text responses that do not align with the given visual input.<n>We introduce self-correcting Decoding with Generative Feedback (DeGF), a novel training-free algorithm that incorporates feedback from text-to-image generative models into the decoding process.
arXiv Detail & Related papers (2025-02-10T03:43:55Z)
Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding [66.06337890279839]
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks.<n>LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content.<n>We propose an Inter-Modality Correlation Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner.
arXiv Detail & Related papers (2025-01-03T17:56:28Z)
Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge [24.538839144639653]
Large Vision-Language Models (LVLMs) integrate separately pre-trained vision and language components. These models frequently encounter a core issue of "cognitive misalignment" between the vision encoder (VE) and the large language model (LLM)
arXiv Detail & Related papers (2024-11-25T18:33:14Z)
From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks [33.476693301050275]
We conduct experiments with truncation strategies across various LVLMs for visual question answering and image captioning tasks. By exploring the information flow from the perspective of visual representation contribution, we observe that it tends to converge in shallow layers but diversify in deeper layers.
arXiv Detail & Related papers (2024-06-04T13:52:54Z)
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest. This technique allows LVLMs to access more detailed visual information without altering the original image resolution. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z)
Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing. Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image. To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z)
Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models [57.95366341738857]
In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept.<n>We propose a multiple attribute-centric evaluation benchmark, Finer, to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.
arXiv Detail & Related papers (2024-02-26T05:43:51Z)
Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance [51.30560006045442]
Image-gRounded guIdaNcE (MARINE) is a framework that is both training-free and API-free.<n>MARINE effectively and efficiently reduces object hallucinations during inference by introducing image-grounded guidance to LVLMs.<n>Our framework's flexibility further allows for the integration of multiple vision models, enabling more reliable and robust object-level guidance.
arXiv Detail & Related papers (2024-02-13T18:59:05Z)
Sparsity-Guided Holistic Explanation for LLMs with Interpretable Inference-Time Intervention [53.896974148579346]
Large Language Models (LLMs) have achieved unprecedented breakthroughs in various natural language processing domains. The enigmatic black-box'' nature of LLMs remains a significant challenge for interpretability, hampering transparent and accountable applications. We propose a novel methodology anchored in sparsity-guided techniques, aiming to provide a holistic interpretation of LLMs.
arXiv Detail & Related papers (2023-12-22T19:55:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.