What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation
- URL: http://arxiv.org/abs/2406.16320v2
- Date: Fri, 18 Oct 2024 22:25:49 GMT
- Title: What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation
- Authors: Michal Golovanevsky, William Rudman, Vedant Palit, Ritambhara Singh, Carsten Eickhoff,
- Abstract summary: Notice is the first Noise-free Text-Image Corruption and Evaluation pipeline for interpretability in Vision-Language Models (VLMs)
Our experiments on the SVO-Probes, MIT-States, and Facial Expression Recognition datasets reveal crucial insights into VLM decision-making.
This work paves the way for more transparent and interpretable multimodal systems.
- Score: 16.033361754660316
- License:
- Abstract: Vision-Language Models (VLMs) have gained community-spanning prominence due to their ability to integrate visual and textual inputs to perform complex tasks. Despite their success, the internal decision-making processes of these models remain opaque, posing challenges in high-stakes applications. To address this, we introduce NOTICE, the first Noise-free Text-Image Corruption and Evaluation pipeline for mechanistic interpretability in VLMs. NOTICE incorporates a Semantic Minimal Pairs (SMP) framework for image corruption and Symmetric Token Replacement (STR) for text. This approach enables semantically meaningful causal mediation analysis for both modalities, providing a robust method for analyzing multimodal integration within models like BLIP. Our experiments on the SVO-Probes, MIT-States, and Facial Expression Recognition datasets reveal crucial insights into VLM decision-making, identifying the significant role of middle-layer cross-attention heads. Further, we uncover a set of ``universal cross-attention heads'' that consistently contribute across tasks and modalities, each performing distinct functions such as implicit image segmentation, object inhibition, and outlier inhibition. This work paves the way for more transparent and interpretable multimodal systems.
Related papers
- A Unified Debiasing Approach for Vision-Language Models across Modalities and Tasks [12.313257689227013]
This paper introduces Selective Feature Imputation for Debiasing (SFID), a novel methodology that integrates feature pruning and low confidence imputation.
SFID is versatile, maintaining the semantic integrity of outputs and costly effective by eliminating the need for retraining.
Our experimental results demonstrate SFID's effectiveness across various VLMs tasks including zero-shot classification, text-to-image retrieval, image captioning, and text-to-image generation.
arXiv Detail & Related papers (2024-10-10T03:57:48Z) - Why context matters in VQA and Reasoning: Semantic interventions for VLM input modalities [18.859309032300402]
We investigate how the integration of information from image and text modalities influences the performance and behavior of Visual Language Model (VLM) predictions.
We study the interplay between text and image modalities in different configurations where visual content is essential for solving the VQA task.
Our results show that complementary information between modalities improves answer and reasoning quality, while contradictory information harms model performance and confidence.
arXiv Detail & Related papers (2024-10-02T16:02:02Z) - LVLM-empowered Multi-modal Representation Learning for Visual Place Recognition [17.388776062997813]
We try to build a discriminative global representations by fusing image data and text descriptions of the the visual scene.
The motivation is twofold: (1) Current Large Vision-Language Models (LVLMs) demonstrate extraordinary emergent capability in visual instruction following, and thus provide an efficient and flexible manner in generating text descriptions of images.
Although promising, leveraging LVLMs to build multi-modal VPR solutions remains challenging in efficient multi-modal fusion.
arXiv Detail & Related papers (2024-07-09T10:15:31Z) - Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning [79.38140606606126]
We propose an algorithmic framework that fine-tunes vision-language models (VLMs) with reinforcement learning (RL)
Our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning.
We demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks.
arXiv Detail & Related papers (2024-05-16T17:50:19Z) - Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks.
We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture.
Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z) - Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing.
Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image.
To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - Enhancing Multimodal Entity and Relation Extraction with Variational
Information Bottleneck [12.957002659910456]
We study the multimodal named entity recognition (MNER) and multimodal relation extraction (MRE)
The core of MNER and MRE lies in incorporating evident visual information to enhance textual semantics.
We propose a novel method for MNER and MRE by Multi-Modal representation learning with Information Bottleneck (MMIB)
arXiv Detail & Related papers (2023-04-05T09:32:25Z) - MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model [35.52349231889843]
We project the representations of all modalities as probabilistic distributions via a Probability Distribution (PDE)
Compared to the existing deterministic methods, such uncertainty modeling can convey richer multimodal semantic information.
We propose suitable pre-training tasks: Distribution-based Vision-Language Contrastive learning (D-VLC), Distribution-based Masked Language Modeling (D-MLM), and Distribution-based Image-Text Matching (D-ITM)
arXiv Detail & Related papers (2022-10-11T10:54:54Z) - Correlation Information Bottleneck: Towards Adapting Pretrained
Multimodal Models for Robust Visual Question Answering [63.87200781247364]
Correlation Information Bottleneck (CIB) seeks a tradeoff between compression and redundancy in representations.
We derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations.
arXiv Detail & Related papers (2022-09-14T22:04:10Z) - Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person
Re-Identification [208.1227090864602]
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem.
Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images.
We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
arXiv Detail & Related papers (2020-07-18T03:08:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.