Word-Sequence Entropy: Towards Uncertainty Estimation in Free-Form
Medical Question Answering Applications and Beyond
- URL: http://arxiv.org/abs/2402.14259v1
- Date: Thu, 22 Feb 2024 03:46:08 GMT
- Title: Word-Sequence Entropy: Towards Uncertainty Estimation in Free-Form
Medical Question Answering Applications and Beyond
- Authors: Zhiyuan Wang, Jinhao Duan, Chenxi Yuan, Qingyu Chen, Tianlong Chen,
Huaxiu Yao, Yue Zhang, Ren Wang, Kaidi Xu, Xiaoshuang Shi
- Abstract summary: Uncertainty estimation plays a pivotal role in ensuring the reliability of safety-critical human-AI interaction systems.
We propose the Word-Sequence Entropy (WSE), which calibrates the uncertainty proportion at both the word and sequence levels according to semantic relevance.
We show that WSE exhibits superior performance on accurate uncertainty measurement under two standard criteria for correctness evaluation.
- Score: 63.969531254692725
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Uncertainty estimation plays a pivotal role in ensuring the reliability of
safety-critical human-AI interaction systems, particularly in the medical
domain. However, a general method for quantifying the uncertainty of free-form
answers has yet to be established in open-ended medical question-answering (QA)
tasks, where irrelevant words and sequences with limited semantic information
can be the primary source of uncertainty due to the presence of generative
inequality. In this paper, we propose the Word-Sequence Entropy (WSE), which
calibrates the uncertainty proportion at both the word and sequence levels
according to the semantic relevance, with greater emphasis placed on keywords
and more relevant sequences when performing uncertainty quantification. We
compare WSE with 6 baseline methods on 5 free-form medical QA datasets,
utilizing 7 "off-the-shelf" large language models (LLMs), and show that WSE
exhibits superior performance on accurate uncertainty measurement under two
standard criteria for correctness evaluation (e.g., WSE outperforms existing
state-of-the-art method by 3.23% AUROC on the MedQA dataset). Additionally, in
terms of the potential for real-world medical QA applications, we achieve a
significant enhancement in the performance of LLMs when employing sequences
with lower uncertainty, identified by WSE, as final answers (e.g., +6.36%
accuracy improvement on the COVID-QA dataset), without requiring any additional
task-specific fine-tuning or architectural modifications.
Related papers
- VLM-UQBench: A Benchmark for Modality-Specific and Cross-Modality Uncertainties in Vision Language Models [12.180198973471645]
We introduce VLM-UQBench, a benchmark for modality-specific and cross-modal data uncertainty in vision-language models (VLMs)<n>It consists of 600 real-world samples drawn from the VizWiz dataset, curated into clean, image-, text-, and cross-modal uncertainty subsets, and a scalable perturbation pipeline with 8 visual, 5 textual, and 3 cross-modal perturbations.
arXiv Detail & Related papers (2026-02-09T21:37:09Z) - Benchmarking Uncertainty Calibration in Large Language Model Long-Form Question Answering [7.1559850008795385]
Large Language Models (LLMs) are commonly used in Question Answering (QA) settings.<n>Existing UQ approaches remain weakly validated in scientific QA.<n>We introduce the first large-scale benchmark for evaluating UQ metrics in reasoning-demanding QA.
arXiv Detail & Related papers (2026-01-30T20:02:34Z) - Mind the Ambiguity: Aleatoric Uncertainty Quantification in LLMs for Safe Medical Question Answering [6.782185804809171]
Large Language Models in Medical Question Answering severely hampered by ambiguous user queries.<n>In this paper, we formalize this challenge by linking input ambiguity to aleatoric uncertainty (AU), which is the irreducible uncertainty arising from underspecified input.<n>We introduce a novel AU-guided "Clarify-Before-Answer" framework, which incorporates AU-Probe - a lightweight module that detects input ambiguity directly from hidden states.
arXiv Detail & Related papers (2026-01-24T03:44:08Z) - Uncertainty as Feature Gaps: Epistemic Uncertainty Quantification of LLMs in Contextual Question-Answering [29.4458902836278]
We introduce a task-agnostic, token-level uncertainty measure defined as the cross-entropy between the predictive distribution of the given model and the unknown true distribution.<n>We derive an upper bound for uncertainty and show that it can be interpreted as semantic feature gaps in the given model's hidden representations.<n>We apply this generic framework to the contextual QA task and hypothesize that three features approximate this gap: context-reliance, context comprehension, and honesty.
arXiv Detail & Related papers (2025-10-03T02:09:25Z) - Inv-Entropy: A Fully Probabilistic Framework for Uncertainty Quantification in Language Models [5.6672926445919165]
Large language models (LLMs) have transformed natural language processing, but their reliable deployment requires effective uncertainty quantification (UQ)<n>Existing UQ methods are often and lack a probabilistic foundation.<n>We propose a fully probabilistic framework based on an inverse model, which quantifies uncertainty by evaluating the diversity of the input space conditioned on a given output through systematic perturbations.
arXiv Detail & Related papers (2025-06-11T13:02:17Z) - Uncertainty Quantification and Confidence Calibration in Large Language Models: A Survey [11.737403011836532]
Large Language Models (LLMs) excel in text generation, reasoning, and decision-making in high-stakes domains such as healthcare, law, and transportation.
Uncertainty quantification (UQ) enhances trustworthiness by estimating confidence in outputs, enabling risk mitigation and selective prediction.
We introduce a new taxonomy that categorizes UQ methods based on computational efficiency and uncertainty dimensions.
arXiv Detail & Related papers (2025-03-20T05:04:29Z) - Statistical Guarantees of Correctness Coverage for Medical Multiple-Choice Question Answering [0.0]
Large language models (LLMs) are increasingly deployed in real-world question-answering (QA) applications.
LLMs have been proven to generate hallucinations and nonfactual information, undermining their trustworthiness in high-stakes medical tasks.
In this work, we for the first time adapt the CP framework to medical multiple-choice question-answering (MCQA) tasks.
arXiv Detail & Related papers (2025-03-07T15:22:10Z) - Legitimate ground-truth-free metrics for deep uncertainty classification scoring [3.9599054392856483]
The use of Uncertainty Quantification (UQ) methods in production remains limited.
This limitation is exacerbated by the challenge of validating UQ methods in absence of UQ ground truth.
This paper investigates such metrics and proves that they are theoretically well-behaved and actually tied to some uncertainty ground truth.
arXiv Detail & Related papers (2024-10-30T14:14:32Z) - Certainly Uncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness [106.52630978891054]
We present a taxonomy of uncertainty specific to vision-language AI systems.
We also introduce a new metric confidence-weighted accuracy, that is well correlated with both accuracy and calibration error.
arXiv Detail & Related papers (2024-07-02T04:23:54Z) - Uncertainty Quantification in Table Structure Recognition [6.328777177761948]
This paper proposes a method for uncertainty quantification (UQ) of table structure recognition (TSR)
Our key idea is to enrich and diversify the table representations, to spotlight the cells with high recognition uncertainties.
Cell complexity quantification gauges the uncertainty of each cell by its topological relation with neighboring cells.
arXiv Detail & Related papers (2024-07-01T19:03:55Z) - ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees [68.33498595506941]
We introduce a novel uncertainty measure based on self-consistency theory.
We then develop a conformal uncertainty criterion by integrating the uncertainty condition aligned with correctness into the CP algorithm.
Empirical evaluations indicate that our uncertainty measure outperforms prior state-of-the-art methods.
arXiv Detail & Related papers (2024-06-29T17:33:07Z) - Kernel Language Entropy: Fine-grained Uncertainty Quantification for LLMs from Semantic Similarities [79.9629927171974]
Uncertainty in Large Language Models (LLMs) is crucial for applications where safety and reliability are important.
We propose Kernel Language Entropy (KLE), a novel method for uncertainty estimation in white- and black-box LLMs.
arXiv Detail & Related papers (2024-05-30T12:42:05Z) - Uncertainty-aware Language Modeling for Selective Question Answering [107.47864420630923]
We present an automatic large language model (LLM) conversion approach that produces uncertainty-aware LLMs.
Our approach is model- and data-agnostic, is computationally-efficient, and does not rely on external models or systems.
arXiv Detail & Related papers (2023-11-26T22:47:54Z) - Towards Clear Expectations for Uncertainty Estimation [64.20262246029286]
Uncertainty Quantification (UQ) is crucial to achieve trustworthy Machine Learning (ML)
Most UQ methods suffer from disparate and inconsistent evaluation protocols.
This opinion paper offers a new perspective by specifying those requirements through five downstream tasks.
arXiv Detail & Related papers (2022-07-27T07:50:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.