Attention, Please! PixelSHAP Reveals What Vision-Language Models Actually Focus On
- URL: http://arxiv.org/abs/2503.06670v1
- Date: Sun, 09 Mar 2025 15:43:55 GMT
- Title: Attention, Please! PixelSHAP Reveals What Vision-Language Models Actually Focus On
- Authors: Roni Goldshmidt,
- Abstract summary: PixelSHAP is a framework extending Shapley-based analysis to structured visual entities.<n>It applies to vision-based reasoning by systematically perturbing image objects and quantifying their influence on a VLM's response.<n>It supports diverse embedding-based similarity metrics and scales efficiently using optimization techniques inspired by Shapley-based methods.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Interpretability in Vision-Language Models (VLMs) is crucial for trust, debugging, and decision-making in high-stakes applications. We introduce PixelSHAP, a model-agnostic framework extending Shapley-based analysis to structured visual entities. Unlike previous methods focusing on text prompts, PixelSHAP applies to vision-based reasoning by systematically perturbing image objects and quantifying their influence on a VLM's response. PixelSHAP requires no model internals, operating solely on input-output pairs, making it compatible with open-source and commercial models. It supports diverse embedding-based similarity metrics and scales efficiently using optimization techniques inspired by Shapley-based methods. We validate PixelSHAP in autonomous driving, highlighting its ability to enhance interpretability. Key challenges include segmentation sensitivity and object occlusion. Our open-source implementation facilitates further research.
Related papers
- Optimized Unet with Attention Mechanism for Multi-Scale Semantic Segmentation [8.443350618722564]
This paper proposes an improved Unet model combined with an attention mechanism.
It introduces channel attention and spatial attention modules, enhances the model's ability to focus on important features.
The improved model performs well in terms of mIoU and pixel accuracy (PA), reaching 76.5% and 95.3% respectively.
arXiv Detail & Related papers (2025-02-06T06:51:23Z) - PixelWorld: Towards Perceiving Everything as Pixels [50.13953243722129]
We propose to unify all modalities (text, tables, code, diagrams, images, etc) as pixel inputs, i.e. "Perceive Everything as Pixels" (PEAP)<n>We introduce PixelWorld, a novel evaluation suite that unifies all the mentioned modalities into pixel space to gauge the existing models' performance.
arXiv Detail & Related papers (2025-01-31T17:39:21Z) - Integrating Object Detection Modality into Visual Language Model for Enhanced Autonomous Driving Agent [8.212818176634116]
We extend the Llama-Adapter architecture by incorporating a YOLOS-based detection network alongside the CLIP perception network.
Our approach introduces camera ID-separators to improve multi-view processing, crucial for comprehensive environmental awareness.
arXiv Detail & Related papers (2024-11-08T15:50:30Z) - A Spitting Image: Modular Superpixel Tokenization in Vision Transformers [0.0]
Vision Transformer (ViT) architectures traditionally employ a grid-based approach to tokenization independent of the semantic content of an image.
We propose a modular superpixel tokenization strategy which decouples tokenization and feature extraction.
arXiv Detail & Related papers (2024-08-14T17:28:58Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Towards Training-free Open-world Segmentation via Image Prompt Foundation Models [13.720784509709496]
Image Prompt (IPSeg) is a training-free paradigm that capitalizes on image prompt techniques.
IPSeg utilizes a single image containing a subjective visual concept as a flexible prompt to query vision foundation models.
Our approach extracts robust features for the prompt image and input image, then matches the input representations to the prompt representations via a novel feature interaction module.
arXiv Detail & Related papers (2023-10-17T01:12:08Z) - Pixel-Inconsistency Modeling for Image Manipulation Localization [59.968362815126326]
Digital image forensics plays a crucial role in image authentication and manipulation localization.
This paper presents a generalized and robust manipulation localization model through the analysis of pixel inconsistency artifacts.
Experiments show that our method successfully extracts inherent pixel-inconsistency forgery fingerprints.
arXiv Detail & Related papers (2023-09-30T02:54:51Z) - Holistic Prototype Attention Network for Few-Shot VOS [74.25124421163542]
Few-shot video object segmentation (FSVOS) aims to segment dynamic objects of unseen classes by resorting to a small set of support images.
We propose a holistic prototype attention network (HPAN) for advancing FSVOS.
arXiv Detail & Related papers (2023-07-16T03:48:57Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Switchable Representation Learning Framework with Self-compatibility [50.48336074436792]
We propose a Switchable representation learning Framework with Self-Compatibility (SFSC)
SFSC generates a series of compatible sub-models with different capacities through one training process.
SFSC achieves state-of-the-art performance on the evaluated datasets.
arXiv Detail & Related papers (2022-06-16T16:46:32Z) - Combining Counterfactuals With Shapley Values To Explain Image Models [13.671174461441304]
We develop a pipeline to generate counterfactuals and estimate Shapley values.
We obtain contrastive and interpretable explanations with strong axiomatic guarantees.
arXiv Detail & Related papers (2022-06-14T18:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.