Word-Sequence Entropy: Towards Uncertainty Estimation in Free-Form
Medical Question Answering Applications and Beyond
- URL: http://arxiv.org/abs/2402.14259v1
- Date: Thu, 22 Feb 2024 03:46:08 GMT
- Title: Word-Sequence Entropy: Towards Uncertainty Estimation in Free-Form
Medical Question Answering Applications and Beyond
- Authors: Zhiyuan Wang, Jinhao Duan, Chenxi Yuan, Qingyu Chen, Tianlong Chen,
Huaxiu Yao, Yue Zhang, Ren Wang, Kaidi Xu, Xiaoshuang Shi
- Abstract summary: Uncertainty estimation plays a pivotal role in ensuring the reliability of safety-critical human-AI interaction systems.
We propose the Word-Sequence Entropy (WSE), which calibrates the uncertainty proportion at both the word and sequence levels according to semantic relevance.
We show that WSE exhibits superior performance on accurate uncertainty measurement under two standard criteria for correctness evaluation.
- Score: 63.969531254692725
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Uncertainty estimation plays a pivotal role in ensuring the reliability of
safety-critical human-AI interaction systems, particularly in the medical
domain. However, a general method for quantifying the uncertainty of free-form
answers has yet to be established in open-ended medical question-answering (QA)
tasks, where irrelevant words and sequences with limited semantic information
can be the primary source of uncertainty due to the presence of generative
inequality. In this paper, we propose the Word-Sequence Entropy (WSE), which
calibrates the uncertainty proportion at both the word and sequence levels
according to the semantic relevance, with greater emphasis placed on keywords
and more relevant sequences when performing uncertainty quantification. We
compare WSE with 6 baseline methods on 5 free-form medical QA datasets,
utilizing 7 "off-the-shelf" large language models (LLMs), and show that WSE
exhibits superior performance on accurate uncertainty measurement under two
standard criteria for correctness evaluation (e.g., WSE outperforms existing
state-of-the-art method by 3.23% AUROC on the MedQA dataset). Additionally, in
terms of the potential for real-world medical QA applications, we achieve a
significant enhancement in the performance of LLMs when employing sequences
with lower uncertainty, identified by WSE, as final answers (e.g., +6.36%
accuracy improvement on the COVID-QA dataset), without requiring any additional
task-specific fine-tuning or architectural modifications.
Related papers
- Certainly Uncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness [106.52630978891054]
We present a taxonomy of uncertainty specific to vision-language AI systems.
We also introduce a new metric confidence-weighted accuracy, that is well correlated with both accuracy and calibration error.
arXiv Detail & Related papers (2024-07-02T04:23:54Z) - ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees [68.33498595506941]
We introduce a novel uncertainty measure based on self-consistency theory.
We then develop a conformal uncertainty criterion by integrating the uncertainty condition aligned with correctness into the CP algorithm.
Empirical evaluations indicate that our uncertainty measure outperforms prior state-of-the-art methods.
arXiv Detail & Related papers (2024-06-29T17:33:07Z) - Uncertainty-aware Language Modeling for Selective Question Answering [107.47864420630923]
We present an automatic large language model (LLM) conversion approach that produces uncertainty-aware LLMs.
Our approach is model- and data-agnostic, is computationally-efficient, and does not rely on external models or systems.
arXiv Detail & Related papers (2023-11-26T22:47:54Z) - Evaluating AI systems under uncertain ground truth: a case study in
dermatology [44.80772162289557]
We propose a metric for measuring annotation uncertainty and provide uncertainty-adjusted metrics for performance evaluation.
We present a case study applying our framework to skin condition classification from images where annotations are provided in the form of differential diagnoses.
arXiv Detail & Related papers (2023-07-05T10:33:45Z) - Towards Reliable Medical Image Segmentation by utilizing Evidential Calibrated Uncertainty [52.03490691733464]
We introduce DEviS, an easily implementable foundational model that seamlessly integrates into various medical image segmentation networks.
By leveraging subjective logic theory, we explicitly model probability and uncertainty for the problem of medical image segmentation.
DeviS incorporates an uncertainty-aware filtering module, which utilizes the metric of uncertainty-calibrated error to filter reliable data.
arXiv Detail & Related papers (2023-01-01T05:02:46Z) - Modeling Disagreement in Automatic Data Labelling for Semi-Supervised
Learning in Clinical Natural Language Processing [2.016042047576802]
We investigate the quality of uncertainty estimates from a range of current state-of-the-art predictive models applied to the problem of observation detection in radiology reports.
arXiv Detail & Related papers (2022-05-29T20:20:49Z) - Bayesian autoencoders with uncertainty quantification: Towards
trustworthy anomaly detection [78.24964622317634]
In this work, the formulation of Bayesian autoencoders (BAEs) is adopted to quantify the total anomaly uncertainty.
To evaluate the quality of uncertainty, we consider the task of classifying anomalies with the additional option of rejecting predictions of high uncertainty.
Our experiments demonstrate the effectiveness of the BAE and total anomaly uncertainty on a set of benchmark datasets and two real datasets for manufacturing.
arXiv Detail & Related papers (2022-02-25T12:20:04Z) - Distribution-Free Federated Learning with Conformal Predictions [0.0]
Federated learning aims to leverage separate institutional datasets while maintaining patient privacy.
Poor calibration and lack of interpretability may hamper widespread deployment of federated models into clinical practice.
We propose to address these challenges by incorporating an adaptive conformal framework into federated learning.
arXiv Detail & Related papers (2021-10-14T18:41:17Z) - Q-Pain: A Question Answering Dataset to Measure Social Bias in Pain
Management [5.044336341666555]
We introduce Q-Pain, a dataset for assessing bias in medical QA in the context of pain management.
We propose a new, rigorous framework, including a sample experimental design, to measure the potential biases present when making treatment decisions.
arXiv Detail & Related papers (2021-08-03T21:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.