Believing without Seeing: Quality Scores for Contextualizing Vision-Language Model Explanations
- URL: http://arxiv.org/abs/2509.25844v1
- Date: Tue, 30 Sep 2025 06:34:21 GMT
- Title: Believing without Seeing: Quality Scores for Contextualizing Vision-Language Model Explanations
- Authors: Keyu He, Tejas Srinivasan, Brihi Joshi, Xiang Ren, Jesse Thomason, Swabha Swayamdipta,
- Abstract summary: We propose evaluating two complementary qualities of VLM-generated explanations via two quality scoring functions.<n>We conduct a user study in which participants have to decide whether a VLM prediction is accurate without viewing its visual context.<n>We observe that showing our quality scores alongside VLM explanations improves participants' accuracy at predicting VLM correctness by 11.1%.
- Score: 41.09442370052903
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When people query Vision-Language Models (VLMs) but cannot see the accompanying visual context (e.g. for blind and low-vision users), augmenting VLM predictions with natural language explanations can signal which model predictions are reliable. However, prior work has found that explanations can easily convince users that inaccurate VLM predictions are correct. To remedy undesirable overreliance on VLM predictions, we propose evaluating two complementary qualities of VLM-generated explanations via two quality scoring functions. We propose Visual Fidelity, which captures how faithful an explanation is to the visual context, and Contrastiveness, which captures how well the explanation identifies visual details that distinguish the model's prediction from plausible alternatives. On the A-OKVQA and VizWiz tasks, these quality scoring functions are better calibrated with model correctness than existing explanation qualities. We conduct a user study in which participants have to decide whether a VLM prediction is accurate without viewing its visual context. We observe that showing our quality scores alongside VLM explanations improves participants' accuracy at predicting VLM correctness by 11.1%, including a 15.4% reduction in the rate of falsely believing incorrect predictions. These findings highlight the utility of explanation quality scores in fostering appropriate reliance on VLM predictions.
Related papers
- Building Reasonable Inference for Vision-Language Models in Blind Image Quality Assessment [7.969076042774561]
We analyze the factors that cause contradictory assessments and instability.<n>We introduce a two-stage tuning method that explicitly separates visual perception from quality inference.
arXiv Detail & Related papers (2025-12-10T11:50:42Z) - Explanation-Driven Counterfactual Testing for Faithfulness in Vision-Language Model Explanations [0.8657627742603715]
Vision-Language Models (VLMs) often produce fluent Natural Language Explanations (NLEs) that sound convincing but may not reflect causal factors driving predictions.<n>This mismatch of plausibility and faithfulness poses technical and governance risks.<n>We introduce Explanation-Driven Counterfactual Testing (EDCT), a fully automated verification procedure for a target VLM.
arXiv Detail & Related papers (2025-09-27T15:16:23Z) - Improve Vision Language Model Chain-of-thought Reasoning [86.83335752119741]
Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness.
We show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed responses.
arXiv Detail & Related papers (2024-10-21T17:00:06Z) - Selective "Selective Prediction": Reducing Unnecessary Abstention in Vision-Language Reasoning [67.82016092549284]
We introduce ReCoVERR, an inference-time algorithm to reduce the over-abstention of a selective vision-language system.
ReCoVERR tries to find relevant clues in an image that provide additional evidence for the prediction.
arXiv Detail & Related papers (2024-02-23T21:16:52Z) - Uncertainty-Aware Evaluation for Vision-Language Models [0.0]
Current evaluation methods overlook an essential component: uncertainty.
We show that models with the highest accuracy may also have the highest uncertainty.
Our empirical findings also reveal a correlation between model uncertainty and its language model part.
arXiv Detail & Related papers (2024-02-22T10:04:17Z) - Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity [45.86789047206224]
This paper presents novel benchmarks for evaluating vision-language models (VLMs) in zero-shot recognition.
Our benchmarks test VLMs' consistency in understanding concepts across semantic granularity levels and their response to varying text specificity.
Findings show that VLMs favor moderately fine-grained concepts and struggle with specificity, often misjudging texts that differ from their training data.
arXiv Detail & Related papers (2023-06-28T09:29:06Z) - VisFIS: Visual Feature Importance Supervision with
Right-for-the-Right-Reason Objectives [84.48039784446166]
We show that model FI supervision can meaningfully improve VQA model accuracy as well as performance on several Right-for-the-Right-Reason metrics.
Our best performing method, Visual Feature Importance Supervision (VisFIS), outperforms strong baselines on benchmark VQA datasets.
Predictions are more accurate when explanations are plausible and faithful, and not when they are plausible but not faithful.
arXiv Detail & Related papers (2022-06-22T17:02:01Z) - The Unreliability of Explanations in Few-Shot In-Context Learning [50.77996380021221]
We focus on two NLP tasks that involve reasoning over text, namely question answering and natural language inference.
We show that explanations judged as good by humans--those that are logically consistent with the input--usually indicate more accurate predictions.
We present a framework for calibrating model predictions based on the reliability of the explanations.
arXiv Detail & Related papers (2022-05-06T17:57:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.