Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions
- URL: http://arxiv.org/abs/2508.05430v1
- Date: Thu, 07 Aug 2025 14:18:56 GMT
- Title: Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions
- Authors: Hubert Baniecki, Maximilian Muschalik, Fabian Fumagalli, Barbara Hammer, Eyke Hüllermeier, Przemyslaw Biecek,
- Abstract summary: Language-image pre-training (LIP) enables the development of vision-language models capable of zero-shot classification, localization, multimodal retrieval, and semantic understanding.<n>Various explanation methods have been proposed to visualize the importance of input image-text pairs on the model's similarity outputs.<n>We introduce faithful interaction explanations of LIP models (FIxLIP) as a unified approach to decomposing the similarity in vision-language encoders.
- Score: 25.897711293173362
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language-image pre-training (LIP) enables the development of vision-language models capable of zero-shot classification, localization, multimodal retrieval, and semantic understanding. Various explanation methods have been proposed to visualize the importance of input image-text pairs on the model's similarity outputs. However, popular saliency maps are limited by capturing only first-order attributions, overlooking the complex cross-modal interactions intrinsic to such encoders. We introduce faithful interaction explanations of LIP models (FIxLIP) as a unified approach to decomposing the similarity in vision-language encoders. FIxLIP is rooted in game theory, where we analyze how using the weighted Banzhaf interaction index offers greater flexibility and improves computational efficiency over the Shapley interaction quantification framework. From a practical perspective, we propose how to naturally extend explanation evaluation metrics, like the pointing game and area between the insertion/deletion curves, to second-order interaction explanations. Experiments on MS COCO and ImageNet-1k benchmarks validate that second-order methods like FIxLIP outperform first-order attribution methods. Beyond delivering high-quality explanations, we demonstrate the utility of FIxLIP in comparing different models like CLIP vs. SigLIP-2 and ViT-B/32 vs. ViT-L/16.
Related papers
- Open-Vocabulary HOI Detection with Interaction-aware Prompt and Concept Calibration [42.24582981160835]
Open Human-Object Interaction (HOI) detection aims to detect interactions between humans and objects.<n>Current methods often rely on Vision and Language Models (VLMs) but face challenges due to suboptimal image encoders.<n>We propose INteraction-aware Prompting with Concept (INP-CC), an end-to-end open-vocabulary HOI detector.
arXiv Detail & Related papers (2025-08-05T08:33:58Z) - Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model [56.573203512455706]
Large-scale vision-language models (VLMs) have achieved remarkable success in zero-shot learning (ZSL) by leveraging large-scale visual-text pair datasets.<n>One approach to address this issue is to develop interpretable models by integrating language.<n>We propose LaZSL, a locally-aligned vision-language model for interpretable ZSL.
arXiv Detail & Related papers (2025-06-30T13:14:46Z) - DiffCLIP: Few-shot Language-driven Multimodal Classifier [19.145645804307566]
DiffCLIP is a novel framework that extends Contrastive Language-Image Pretraining.<n>It conveys comprehensive language-driven semantic information for accurate classification of high-dimensional multimodal remote sensing images.<n>DiffCLIP achieves an overall accuracy improvement of 10.65% across three remote sensing datasets compared with CLIP.
arXiv Detail & Related papers (2024-12-10T02:21:39Z) - Explaining Caption-Image Interactions in CLIP models with Second-Order Attributions [28.53636082915161]
CLIP models map two types of inputs into a shared embedding space and predict similarities between them.<n>Despite their success, it is, however, not understood how these models compare their two inputs.<n>Common first-order feature-attribution methods can only provide limited insights into dual-encoders.
arXiv Detail & Related papers (2024-08-26T09:55:34Z) - Explaining Text Similarity in Transformer Models [52.571158418102584]
Recent advances in explainable AI have made it possible to mitigate limitations by leveraging improved explanations for Transformers.
We use BiLRP, an extension developed for computing second-order explanations in bilinear similarity models, to investigate which feature interactions drive similarity in NLP models.
Our findings contribute to a deeper understanding of different semantic similarity tasks and models, highlighting how novel explainable AI methods enable in-depth analyses and corpus-level insights.
arXiv Detail & Related papers (2024-05-10T17:11:31Z) - 3VL: Using Trees to Improve Vision-Language Models' Interpretability [40.678288227161936]
Vision-Language models (VLMs) have proven to be effective at aligning image and text representations, producing superior zero-shot results when transferred to many downstream tasks.<n>These representations suffer from some key shortcomings in understanding Compositional Language Concepts (CLC), such as recognizing objects' attributes, states, and relations between different objects.<n>In this work, we introduce the architecture and training technique of Tree-augmented Vision-Language (3VL) model accompanied by our proposed Anchor inference method and Differential Relevance (DiRe) interpretability tool.
arXiv Detail & Related papers (2023-12-28T20:26:03Z) - LightCLIP: Learning Multi-Level Interaction for Lightweight
Vision-Language Models [45.672539931681065]
We propose a multi-level interaction paradigm for training lightweight CLIP models.
An auxiliary fusion module injecting unmasked image embedding into masked text embedding is proposed.
arXiv Detail & Related papers (2023-12-01T15:54:55Z) - Cross-modal Audio-visual Co-learning for Text-independent Speaker
Verification [55.624946113550195]
This paper proposes a cross-modal speech co-learning paradigm.
Two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation.
Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement.
arXiv Detail & Related papers (2023-02-22T10:06:37Z) - Learning Visual Representation from Modality-Shared Contrastive
Language-Image Pre-training [88.80694147730883]
We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks.
In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters.
Our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks.
arXiv Detail & Related papers (2022-07-26T05:19:16Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - Disentangled Representation Learning for Text-Video Retrieval [51.861423831566626]
Cross-modality interaction is a critical component in Text-Video Retrieval (TVR)
We study the interaction paradigm in depth, where we find that its computation can be split into two terms.
We propose a disentangled framework to capture a sequential and hierarchical representation.
arXiv Detail & Related papers (2022-03-14T13:55:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.