Related papers: Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration

Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration

URL: http://arxiv.org/abs/2409.03225v1
Date: Thu, 5 Sep 2024 03:45:35 GMT
Title: Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration
Authors: Jeremy Qin, Bang Liu, Quoc Dinh Nguyen,
Abstract summary: Black-box large language models (LLMs) are increasingly deployed in various environments. LLMs often exhibit overconfidence, leading to potential risks and misjudgments. We propose a novel method, textitAtypical presentations Recalibration, which leverages atypical presentations to adjust the model's confidence estimates.
Score: 20.049443396032423
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Black-box large language models (LLMs) are increasingly deployed in various environments, making it essential for these models to effectively convey their confidence and uncertainty, especially in high-stakes settings. However, these models often exhibit overconfidence, leading to potential risks and misjudgments. Existing techniques for eliciting and calibrating LLM confidence have primarily focused on general reasoning datasets, yielding only modest improvements. Accurate calibration is crucial for informed decision-making and preventing adverse outcomes but remains challenging due to the complexity and variability of tasks these models perform. In this work, we investigate the miscalibration behavior of black-box LLMs within the healthcare setting. We propose a novel method, \textit{Atypical Presentations Recalibration}, which leverages atypical presentations to adjust the model's confidence estimates. Our approach significantly improves calibration, reducing calibration errors by approximately 60\% on three medical question answering datasets and outperforming existing methods such as vanilla verbalized confidence, CoT verbalized confidence and others. Additionally, we provide an in-depth analysis of the role of atypicality within the recalibration framework.

Related papers

Beyond Overconfidence: Foundation Models Redefine Calibration in Deep Neural Networks [11.21724937864103]
Deep neural networks are known to exhibit systematic overconfidence, especially under distribution shifts.<n>This paper presents a comprehensive investigation into the calibration behavior of foundation models.
arXiv Detail & Related papers (2025-06-11T10:48:36Z)
Calibrating Uncertainty Quantification of Multi-Modal LLMs using Grounding [48.92310906093414]
We introduce a novel approach for calibrating uncertainty quantification (UQ) tailored for multi-modal large language models (LLMs)<n>We leverage cross-modal consistency in addition to self-consistency to improve the calibration of the multi-modal models.<n>We evaluate the proposed approach across multiple multi-modal tasks, such as medical question answering (Slake) and visual question answering (VQAv2), considering multi-modal models such as LLaVA-Med and LLaVA.
arXiv Detail & Related papers (2025-04-30T19:19:21Z)
Object-Level Verbalized Confidence Calibration in Vision-Language Models via Semantic Perturbation [26.580361841501514]
Vision-language models (VLMs) excel in various multimodal tasks but frequently suffer from poor calibration. This miscalibration undermines user trust, especially when models confidently provide incorrect or fabricated information. We propose a novel Confidence through Semantic Perturbation (CSP) framework to improve the calibration of verbalized confidence for object-centric queries.
arXiv Detail & Related papers (2025-04-21T04:01:22Z)
Balancing Two Classifiers via A Simplex ETF Structure for Model Calibration [34.52946891778497]
Deep neural networks (DNNs) have demonstrated state-of-the-art performance across various domains. They often face calibration issues, particularly in safety-critical applications such as autonomous driving and healthcare. Recent research has started to improve model calibration from the view of the classifier.
arXiv Detail & Related papers (2025-04-14T09:09:01Z)
Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models [0.6091702876917281]
This paper examines how model size, mitigateors, and question types affect confidence alignment. We introduce an evaluation framework to measure overconfidence and investigate whether multiple-choice formats or worsen miscalibration.
arXiv Detail & Related papers (2025-02-16T07:46:09Z)
Fact-Level Confidence Calibration and Self-Correction [64.40105513819272]
We propose a Fact-Level framework that calibrates confidence to relevance-weighted correctness at the fact level. We also develop Confidence-Guided Fact-level Self-Correction ($textbfConFix$), which uses high-confidence facts within a response as additional knowledge to improve low-confidence ones.
arXiv Detail & Related papers (2024-11-20T14:15:18Z)
On Calibration of LLM-based Guard Models for Reliable Content Moderation [27.611237252584402]
Large language models (LLMs) pose significant risks due to the potential for generating harmful content or users attempting to evade guardrails. Existing studies have developed LLM-based guard models designed to moderate the input and output of threat LLMs. However, limited attention has been given to the reliability and calibration of such guard models.
arXiv Detail & Related papers (2024-10-14T12:04:06Z)
Confidence Estimation for LLM-Based Dialogue State Tracking [9.305763502526833]
Estimation of a model's confidence on its outputs is critical for Conversational AI systems based on large language models (LLMs) We provide an exhaustive exploration of methods, including approaches proposed for open- and closed-weight LLMs. Our findings suggest that fine-tuning open-weight LLMs can result in enhanced AUC performance, indicating better confidence score calibration.
arXiv Detail & Related papers (2024-09-15T06:44:26Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Confidence Calibration and Rationalization for LLMs via Multi-Agent Deliberation [18.815226646364476]
Existing calibration methods for large language models (LLMs) focus on estimating or eliciting individual confidence without taking full advantage of the "Collective Wisdom" We propose Collaborative, a post-hoc training-free calibration strategy that leverages the collaborative and expressive capabilities of multiple tool-augmented LLM agents in a simulated group deliberation process.
arXiv Detail & Related papers (2024-04-14T02:40:43Z)
Calibrating Large Language Models with Sample Consistency [76.23956851098598]
We explore the potential of deriving confidence from the distribution of multiple randomly sampled model generations, via three measures of consistency. Results show that consistency-based calibration methods outperform existing post-hoc approaches. We offer practical guidance on choosing suitable consistency metrics for calibration, tailored to the characteristics of various LMs.
arXiv Detail & Related papers (2024-02-21T16:15:20Z)
A Study on the Calibration of In-context Learning [27.533223818505682]
We study in-context learning (ICL), a prevalent method for adapting static language models through tailored prompts. We observe that, with an increasing number of ICL examples, models initially exhibit increased miscalibration before achieving better calibration. We explore recalibration techniques and find that a scaling-binning calibrator can reduce calibration errors consistently.
arXiv Detail & Related papers (2023-12-07T03:37:39Z)
On the Calibration of Large Language Models and Alignment [63.605099174744865]
Confidence calibration serves as a crucial tool for gauging the reliability of deep models. We conduct a systematic examination of the calibration of aligned language models throughout the entire construction process. Our work sheds light on whether popular LLMs are well-calibrated and how the training process influences model calibration.
arXiv Detail & Related papers (2023-11-22T08:57:55Z)
Calibrating Multimodal Learning [94.65232214643436]
We propose a novel regularization technique, i.e., Calibrating Multimodal Learning (CML) regularization, to calibrate the predictive confidence of previous methods. This technique could be flexibly equipped by existing models and improve the performance in terms of confidence calibration, classification accuracy, and model robustness.
arXiv Detail & Related papers (2023-06-02T04:29:57Z)
Calibration of Neural Networks [77.34726150561087]
This paper presents a survey of confidence calibration problems in the context of neural networks. We analyze problem statement, calibration definitions, and different approaches to evaluation. Empirical experiments cover various datasets and models, comparing calibration methods according to different criteria.
arXiv Detail & Related papers (2023-03-19T20:27:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.