Logit-Level Uncertainty Quantification in Vision-Language Models for Histopathology Image Analysis
- URL: http://arxiv.org/abs/2603.03527v1
- Date: Tue, 03 Mar 2026 21:21:00 GMT
- Title: Logit-Level Uncertainty Quantification in Vision-Language Models for Histopathology Image Analysis
- Authors: Betul Yurdem, Ferhat Ozgur Catak, Murat Kuzlu, Mehmet Kemal Gullu,
- Abstract summary: Vision-Language Models (VLMs) with their multimodal capabilities have demonstrated remarkable success in almost all domains.<n>This study proposes a logit-level uncertainty quantification framework for histopathology image analysis using VLMs.
- Score: 0.5879782260984691
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Models (VLMs) with their multimodal capabilities have demonstrated remarkable success in almost all domains, including education, transportation, healthcare, energy, finance, law, and retail. Nevertheless, the utilization of VLMs in healthcare applications raises crucial concerns due to the sensitivity of large-scale medical data and the trustworthiness of these models (reliability, transparency, and security). This study proposes a logit-level uncertainty quantification (UQ) framework for histopathology image analysis using VLMs to deal with these concerns. UQ is evaluated for three VLMs using metrics derived from temperature-controlled output logits. The proposed framework demonstrates a critical separation in uncertainty behavior. While VLMs show high stochastic sensitivity (cosine similarity (CS) $<0.71$ and $<0.84$, Jensen-Shannon divergence (JS) $<0.57$ and $<0.38$, and Kullback-Leibler divergence (KL) $<0.55$ and $<0.35$, respectively for mean values of VILA-M3-8B and LLaVA-Med v1.5), near-maximal temperature impacts ($Δ_T \approx 1.00$), and displaying abrupt uncertainty transitions, particularly for complex diagnostic prompts. In contrast, the pathology-specific PRISM model maintains near-deterministic behavior (mean CS $>0.90$, JS $<0.10$, KL $<0.09$) and significantly minimal temperature effects across all prompt complexities. These findings emphasize the importance of logit-level uncertainty quantification to evaluate trustworthiness in histopathology applications utilizing VLMs.
Related papers
- Suppressing Prior-Comparison Hallucinations in Radiology Report Generation via Semantically Decoupled Latent Steering [94.37535002230504]
We develop a training-free, inference-time control framework termed Semantically Decoupled Latent Steering.<n>Our approach constructs a semantic-free intervention vector via large language model (LLM)-driven semantic decomposition.<n>We show that our approach significantly reduces the probability of historical hallucinations.
arXiv Detail & Related papers (2026-02-27T04:49:01Z) - From Global to Granular: Revealing IQA Model Performance via Correlation Surface [83.65597122328133]
We present textbfGranularity-Modulated Correlation (GMC), which provides a structured, fine-grained analysis of IQA performance.<n>GMC includes a textbfDistribution Regulator that regularizes correlations to mitigate biases from non-uniform quality distributions.<n>Experiments on standard benchmarks show that GMC reveals performance characteristics invisible to scalar metrics, offering a more informative and reliable paradigm for analyzing, comparing, and deploying IQA models.
arXiv Detail & Related papers (2026-01-29T13:55:26Z) - PUNCH: Physics-informed Uncertainty-aware Network for Coronary Hemodynamics [8.812266680285369]
We introduce a non-diagnosed, uncertainty-aware framework for estimating coronary flow reserve directly from standard angiography.<n>The system integrates physics-informed neural networks with variational inference to infer coronary blood flow from first-principles models of contrast transport.<n>The pipeline runs in approximately three minutes per patient on a single GPU, with no population-level training.
arXiv Detail & Related papers (2026-01-23T21:47:23Z) - VSF-Med:A Vulnerability Scoring Framework for Medical Vision-Language Models [6.390468088226493]
We introduce VSF--Med, an end-to-end vulnerability-scoring framework for medical Vision Language Models (VLMs)<n>VSF--Med synthesizes over 30,000 adversarial variants from 5,000 radiology images and enables reproducible benchmarking of any medical VLM with a single command.<n>We show that Llama-3.2-11B-Vision-Instruct exhibits a peak vulnerability increase of $1.29sigma$ for persistence-of-attack-effects, while GPT-4o shows increases of $0.69sigma$ for that same vector and $0.28sigma$ for prompt-injection attacks.
arXiv Detail & Related papers (2025-06-25T02:56:38Z) - Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z) - SURE-VQA: Systematic Understanding of Robustness Evaluation in Medical VQA Tasks [2.033441577169909]
Vision-Language Models (VLMs) have great potential in medical tasks, like Visual Question Answering (VQA)<n>Their robustness to distribution shifts on unseen data remains a key concern for safe deployment.<n>We introduce a novel framework, called SURE-VQA, centered around three key requirements to overcome current pitfalls.
arXiv Detail & Related papers (2024-11-29T13:22:52Z) - Improving Medical Diagnostics with Vision-Language Models: Convex Hull-Based Uncertainty Analysis [0.3277163122167434]
This paper proposes a novel approach to evaluate uncertainty in vision-language models (VLMs) using a convex hull approach on a healthcare application for Visual Question Answering (VQA)<n>According to the results, the LLM-CXR VLM shows a high uncertainty at higher temperature settings.
arXiv Detail & Related papers (2024-11-24T17:49:48Z) - VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - Machine Learning for ALSFRS-R Score Prediction: Making Sense of the Sensor Data [44.99833362998488]
Amyotrophic Lateral Sclerosis (ALS) is a rapidly progressive neurodegenerative disease that presents individuals with limited treatment options.
The present investigation, spearheaded by the iDPP@CLEF 2024 challenge, focuses on utilizing sensor-derived data obtained through an app.
arXiv Detail & Related papers (2024-07-10T19:17:23Z) - Localizing Anomalies via Multiscale Score Matching Analysis [13.898576482792173]
This paper introduces Spatial-MSMA, a novel unsupervised method for anomaly localization in brain MRIs.
We employ a flexible normalizing flow model conditioned on patch positions and global image features to estimate patch-wise anomaly scores.
The method is evaluated on a dataset of 1,650 T1- and T2-weighted brain MRIs from typically developing children.
arXiv Detail & Related papers (2024-06-28T17:57:12Z) - Uncertainty in Language Models: Assessment through Rank-Calibration [65.10149293133846]
Language Models (LMs) have shown promising performance in natural language generation.
It is crucial to correctly quantify their uncertainty in responding to given inputs.
We develop a novel and practical framework, termed $Rank$-$Calibration$, to assess uncertainty and confidence measures for LMs.
arXiv Detail & Related papers (2024-04-04T02:31:05Z) - Efficient and Sharp Off-Policy Evaluation in Robust Markov Decision Processes [44.974100402600165]
We study the evaluation of a policy best-parametric and worst-case perturbations to a decision process (MDP)
We use transition observations from the original MDP, whether they are generated under the same or a different policy.
Our estimator is also estimated statistical inference using Wald confidence intervals.
arXiv Detail & Related papers (2024-03-29T18:11:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.