Related papers: The Evaluation Gap in Medicine, AI and LLMs: Navigating Elusive Ground Truth & Uncertainty via a Probabilistic Paradigm

The Evaluation Gap in Medicine, AI and LLMs: Navigating Elusive Ground Truth & Uncertainty via a Probabilistic Paradigm

URL: http://arxiv.org/abs/2601.05500v1
Date: Fri, 09 Jan 2026 03:19:37 GMT
Title: The Evaluation Gap in Medicine, AI and LLMs: Navigating Elusive Ground Truth & Uncertainty via a Probabilistic Paradigm
Authors: Aparna Elangovan, Lei Xu, Mahsa Elyasi, Ismail Akdulum, Mehmet Aksakal, Enes Gurun, Brian Hur, Saab Mansour, Ravid Shwartz Ziv, Karin Verspoor, Dan Roth,
Abstract summary: We introduce a probabilistic paradigm to theoretically explain how high certainty in ground truth answers is almost always necessary for even an expert to achieve high scores.<n>We thus bring forth the concepts of expected accuracy and expected F1 to estimate the score an expert human or system can achieve given ground truth answer variability.
Score: 49.287792149338976
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Benchmarking the relative capabilities of AI systems, including Large Language Models (LLMs) and Vision Models, typically ignores the impact of uncertainty in the underlying ground truth answers from experts. This ambiguity is particularly consequential in medicine where uncertainty is pervasive. In this paper, we introduce a probabilistic paradigm to theoretically explain how high certainty in ground truth answers is almost always necessary for even an expert to achieve high scores, whereas in datasets with high variation in ground truth answers there may be little difference between a random labeller and an expert. Therefore, ignoring uncertainty in ground truth evaluation data can result in the misleading conclusion that a non-expert has similar performance to that of an expert. Using the probabilistic paradigm, we thus bring forth the concepts of expected accuracy and expected F1 to estimate the score an expert human or system can achieve given ground truth answer variability. Our work leads to the recommendation that when establishing the capability of a system, results should be stratified by probability of the ground truth answer, typically measured by the agreement rate of ground truth experts. Stratification becomes critical when the overall performance drops below a threshold of 80%. Under stratified evaluation, performance comparison becomes more reliable in high certainty bins, mitigating the effect of the key confounding factor -- uncertainty.

Related papers

Towards Reliable LLM-based Robot Planning via Combined Uncertainty Estimation [68.106428321492]
Large language models (LLMs) demonstrate advanced reasoning abilities, enabling robots to understand natural language instructions and generate high-level plans with appropriate grounding.<n>LLMs hallucinations present a significant challenge, often leading to overconfident yet potentially misaligned or unsafe plans.<n>We present Combined Uncertainty estimation for Reliable Embodied planning (CURE), which decomposes the uncertainty into epistemic and intrinsic uncertainty, each estimated separately.
arXiv Detail & Related papers (2025-10-09T10:26:58Z)
Cross-World Assumption and Refining Prediction Intervals for Individual Treatment Effects [6.083038976289835]
For high-stakes decision-making, individual treatment effect estimates must be accompanied by valid prediction intervals.<n>For high-stakes decision-making, individual treatment effect estimates must be accompanied by valid prediction intervals.
arXiv Detail & Related papers (2025-07-16T18:58:18Z)
Improving Counterfactual Truthfulness for Molecular Property Prediction through Uncertainty Quantification [0.6144680854063939]
XAI interventions aim to improve interpretability for complex black-box models.<n>In molecular property prediction, counterfactual explanations offer a way to understand predictive behavior.<n>We propose the integration of uncertainty estimation techniques to filter counterfactual candidates with high predicted uncertainty.
arXiv Detail & Related papers (2025-04-03T14:07:30Z)
Probabilistic Modeling of Disparity Uncertainty for Robust and Efficient Stereo Matching [61.73532883992135]
We propose a new uncertainty-aware stereo matching framework.<n>We adopt Bayes risk as the measurement of uncertainty and use it to separately estimate data and model uncertainty.
arXiv Detail & Related papers (2024-12-24T23:28:20Z)
FairlyUncertain: A Comprehensive Benchmark of Uncertainty in Algorithmic Fairness [4.14360329494344]
We introduce FairlyUncertain, an axiomatic benchmark for evaluating uncertainty estimates in fairness. Our benchmark posits that fair predictive uncertainty estimates should be consistent across learning pipelines and calibrated to observed randomness.
arXiv Detail & Related papers (2024-10-02T20:15:29Z)
Auditing Fairness under Unobserved Confounding [56.61738581796362]
We show that, surprisingly, one can still compute meaningful bounds on treatment rates for high-risk individuals.<n>We use the fact that in many real-world settings we have data from prior to any allocation to derive unbiased estimates of risk.
arXiv Detail & Related papers (2024-03-18T21:09:06Z)
Evaluating AI systems under uncertain ground truth: a case study in dermatology [43.8328264420381]
We show that ignoring uncertainty leads to overly optimistic estimates of model performance.<n>In skin condition classification, we find that a large portion of the dataset exhibits significant ground truth uncertainty.
arXiv Detail & Related papers (2023-07-05T10:33:45Z)
Regions of Reliability in the Evaluation of Multivariate Probabilistic Forecasts [73.33395097728128]
We provide the first systematic finite-sample study of proper scoring rules for time-series forecasting evaluation. We carry out our analysis on a comprehensive synthetic benchmark, specifically designed to test several key discrepancies between ground-truth and forecast distributions.
arXiv Detail & Related papers (2023-04-19T17:38:42Z)
Fairness through Aleatoric Uncertainty [18.95295731419523]
We introduce the idea of leveraging aleatoric uncertainty (e.g., data ambiguity) to improve the fairness-utility trade-off. Our central hypothesis is that aleatoric uncertainty is a key factor for algorithmic fairness. We then propose a principled model to improve fairness when aleatoric uncertainty is high and improve utility elsewhere.
arXiv Detail & Related papers (2023-04-07T13:50:57Z)
Discriminative Jackknife: Quantifying Uncertainty in Deep Learning via Higher-Order Influence Functions [121.10450359856242]
We develop a frequentist procedure that utilizes influence functions of a model's loss functional to construct a jackknife (or leave-one-out) estimator of predictive confidence intervals. The DJ satisfies (1) and (2), is applicable to a wide range of deep learning models, is easy to implement, and can be applied in a post-hoc fashion without interfering with model training or compromising its accuracy.
arXiv Detail & Related papers (2020-06-29T13:36:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.