Related papers: The Art of Saying "Maybe": A Conformal Lens for Uncertainty Benchmarking in VLMs

The Art of Saying "Maybe": A Conformal Lens for Uncertainty Benchmarking in VLMs

URL: http://arxiv.org/abs/2509.13379v1
Date: Tue, 16 Sep 2025 08:17:39 GMT
Title: The Art of Saying "Maybe": A Conformal Lens for Uncertainty Benchmarking in VLMs
Authors: Asif Azad, Mohammad Sadat Hossain, MD Sadik Hossain Shanto, M Saifur Rahman, Md Rizwan Pervez,
Abstract summary: Vision-Language Models (VLMs) have achieved remarkable progress in complex visual understanding across scientific and reasoning tasks.<n>We conduct a comprehensive uncertainty benchmarking study, evaluating 16 state-of-the-art VLMs across 6 multimodal datasets with 3 distinct scoring functions.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language Models (VLMs) have achieved remarkable progress in complex visual understanding across scientific and reasoning tasks. While performance benchmarking has advanced our understanding of these capabilities, the critical dimension of uncertainty quantification has received insufficient attention. Therefore, unlike prior conformal prediction studies that focused on limited settings, we conduct a comprehensive uncertainty benchmarking study, evaluating 16 state-of-the-art VLMs (open and closed-source) across 6 multimodal datasets with 3 distinct scoring functions. Our findings demonstrate that larger models consistently exhibit better uncertainty quantification; models that know more also know better what they don't know. More certain models achieve higher accuracy, while mathematical and reasoning tasks elicit poorer uncertainty performance across all models compared to other domains. This work establishes a foundation for reliable uncertainty evaluation in multimodal systems.

Related papers

Explicit Reasoning Makes Better Judges: A Systematic Study on Accuracy, Efficiency, and Robustness [12.513874407270142]
We present a systematic comparison of "thinking" and "non-thinking" Large Language Models (LLMs)<n>We evaluate both accuracy and computational efficiency (FLOPs) on RewardBench tasks.<n>Our results show that thinking models achieve approximately 10% points higher accuracy with little overhead.
arXiv Detail & Related papers (2025-09-09T18:36:02Z)
Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges [72.3356133063925]
The paradigm of large language models (LLMs) as judges has emerged as a scalable solution, yet prior work primarily focuses on simple settings.<n>Our in-depth analysis offers crucial insights for improving the accuracy and verifiability of evaluation signals.
arXiv Detail & Related papers (2025-09-03T15:48:33Z)
Unveiling Trust in Multimodal Large Language Models: Evaluation, Analysis, and Mitigation [51.19622266249408]
MultiTrust-X is a benchmark for evaluating, analyzing, and mitigating the trustworthiness issues of MLLMs.<n>Based on the taxonomy, MultiTrust-X includes 32 tasks and 28 curated datasets.<n>Our experiments reveal significant vulnerabilities in current models.
arXiv Detail & Related papers (2025-08-21T09:00:01Z)
Understanding and Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding [59.50808215134678]
This study introduces Trust-videoLLMs, a first comprehensive benchmark evaluating 23 state-of-the-art videoLLMs.<n>Results reveal significant limitations in dynamic scene comprehension, cross-modal resilience and real-world risk mitigation.
arXiv Detail & Related papers (2025-06-14T04:04:54Z)
Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models [15.158475816860427]
Uncertainty is essential for assessing the reliability and trustworthiness of modern AI systems.<n> verbalized uncertainty, where models express their confidence through natural language, has emerged as a lightweight and interpretable solution.<n>However, its effectiveness in vision-language models (VLMs) remains insufficiently studied.
arXiv Detail & Related papers (2025-05-26T17:16:36Z)
TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning [27.449948943467163]
Large Language Models (LLMs) have demonstrated impressive capabilities, but their output quality remains inconsistent.<n>We propose a Token-level Uncertainty estimation framework for Reasoning (TokUR)<n>Our approach consistently outperforms existing uncertainty estimation methods.
arXiv Detail & Related papers (2025-05-16T22:47:32Z)
A Critical Synthesis of Uncertainty Quantification and Foundation Models in Monocular Depth Estimation [13.062551984263031]
Metric depth estimation, which involves predicting absolute distances, poses particular challenges.<n>We fuse five different uncertainty quantification methods with the current state-of-the-art DepthAnythingV2 foundation model.<n>Our findings identify fine-tuning with the Gaussian Negative Log-Likelihood Loss (GNLL) as a particularly promising approach.
arXiv Detail & Related papers (2025-01-14T15:13:00Z)
AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs [70.4578433679737]
We introduce Audio-Visual Trustworthiness assessment Benchmark (AVTrustBench), comprising 600K samples spanning over 9 meticulously crafted tasks.<n>Using our benchmark we extensively evaluate 13 state-of-the-art AVLLMs.<n>The findings reveal that the majority of existing models fall significantly short of achieving human-like comprehension.
arXiv Detail & Related papers (2025-01-03T23:03:24Z)
UBench: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions [10.28688988951815]
We introduce UBench, a new benchmark for evaluating the uncertainty of large language models (LLMs)<n>Unlike other benchmarks, UBench is based on confidence intervals. It encompasses 11,978 multiple-choice questions spanning knowledge, language, understanding, and reasoning capabilities.<n>Our analysis reveals several crucial insights: 1) Our confidence interval-based methods are highly effective for uncertainty quantification; 2) Regarding uncertainty, outstanding open-source models show competitive performance versus closed-source models; 3) CoT and RP prompts present potential ways to improve model reliability, while the influence of temperature changes follows no universal rule.
arXiv Detail & Related papers (2024-06-18T16:50:38Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Uncertainty-Aware Evaluation for Vision-Language Models [0.0]
Current evaluation methods overlook an essential component: uncertainty. We show that models with the highest accuracy may also have the highest uncertainty. Our empirical findings also reveal a correlation between model uncertainty and its language model part.
arXiv Detail & Related papers (2024-02-22T10:04:17Z)
Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling [69.83976050879318]
In large language models (LLMs), identifying sources of uncertainty is an important step toward improving reliability, trustworthiness, and interpretability. In this paper, we introduce an uncertainty decomposition framework for LLMs, called input clarification ensembling. Our approach generates a set of clarifications for the input, feeds them into an LLM, and ensembles the corresponding predictions.
arXiv Detail & Related papers (2023-11-15T05:58:35Z)
Plex: Towards Reliability using Pretrained Large Model Extensions [69.13326436826227]
We develop ViT-Plex and T5-Plex, pretrained large model extensions for vision and language modalities, respectively. Plex greatly improves the state-of-the-art across reliability tasks, and simplifies the traditional protocol. We demonstrate scaling effects over model sizes up to 1B parameters and pretraining dataset sizes up to 4B examples.
arXiv Detail & Related papers (2022-07-15T11:39:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.