Probabilistic Medical Predictions of Large Language Models
- URL: http://arxiv.org/abs/2408.11316v2
- Date: Tue, 03 Dec 2024 21:54:39 GMT
- Title: Probabilistic Medical Predictions of Large Language Models
- Authors: Bowen Gu, Rishi J. Desai, Kueiyu Joshua Lin, Jie Yang,
- Abstract summary: Large Language Models (LLMs) have shown promise in clinical applications through prompt engineering.
LLMs struggle to produce reliable prediction probabilities, which are crucial for transparency and decision-making.
We compared explicit probabilities from text generation to implicit probabilities derived from the likelihood of predicting the correct label token.
- Score: 4.825666689707888
- License:
- Abstract: Large Language Models (LLMs) have shown promise in clinical applications through prompt engineering, allowing flexible clinical predictions. However, they struggle to produce reliable prediction probabilities, which are crucial for transparency and decision-making. While explicit prompts can lead LLMs to generate probability estimates, their numerical reasoning limitations raise concerns about reliability. We compared explicit probabilities from text generation to implicit probabilities derived from the likelihood of predicting the correct label token. Across six advanced open-source LLMs and five medical datasets, explicit probabilities consistently underperformed implicit probabilities in discrimination, precision, and recall. This discrepancy is more pronounced with smaller LLMs and imbalanced datasets, highlighting the need for cautious interpretation, improved probability estimation methods, and further research for clinical use of LLMs.
Related papers
- Uncertainty Quantification for Clinical Outcome Predictions with (Large) Language Models [10.895429855778747]
We consider the uncertainty quantification of LMs for EHR tasks in white-box and black-box settings.
We show that an effective reduction of model uncertainty can be achieved by using the proposed multi-tasking and ensemble methods in EHRs.
We validate our framework using longitudinal clinical data from more than 6,000 patients in ten clinical prediction tasks.
arXiv Detail & Related papers (2024-11-05T20:20:15Z) - Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method [108.56493934296687]
We introduce a divergence-based calibration method, inspired by the divergence-from-randomness concept, to calibrate token probabilities for pretraining data detection.
We have developed a Chinese-language benchmark, PatentMIA, to assess the performance of detection approaches for LLMs on Chinese text.
arXiv Detail & Related papers (2024-09-23T07:55:35Z) - LLMs for clinical risk prediction [0.0]
GPT-4 exhibited significant deficiencies in identifying positive cases and struggled to provide reliable probability estimates for delirium risk.
Clinalytix Medical AI demonstrated superior accuracy.
arXiv Detail & Related papers (2024-09-16T11:34:40Z) - Calibrated Large Language Models for Binary Question Answering [49.1574468325115]
A well-calibrated model should produce probabilities that accurately reflect the likelihood of its predictions being correct.
We propose a novel approach that utilizes the inductive Venn--Abers predictor (IVAP) to calibrate the probabilities associated with the output tokens corresponding to the binary labels.
arXiv Detail & Related papers (2024-07-01T09:31:03Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - BIRD: A Trustworthy Bayesian Inference Framework for Large Language Models [52.46248487458641]
Predictive models often need to work with incomplete information in real-world tasks.
Current large language models (LLM) are insufficient for such accurate estimations.
We propose BIRD, a novel probabilistic inference framework.
arXiv Detail & Related papers (2024-04-18T20:17:23Z) - Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling [69.83976050879318]
In large language models (LLMs), identifying sources of uncertainty is an important step toward improving reliability, trustworthiness, and interpretability.
In this paper, we introduce an uncertainty decomposition framework for LLMs, called input clarification ensembling.
Our approach generates a set of clarifications for the input, feeds them into an LLM, and ensembles the corresponding predictions.
arXiv Detail & Related papers (2023-11-15T05:58:35Z) - Evaluating Distributional Distortion in Neural Language Modeling [81.83408583979745]
A heavy-tail of rare events accounts for a significant amount of the total probability mass of distributions in language.
Standard language modeling metrics such as perplexity quantify the performance of language models (LM) in aggregate.
We develop a controlled evaluation scheme which uses generative models trained on natural data as artificial languages.
arXiv Detail & Related papers (2022-03-24T01:09:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.