Related papers: Predicting LLM Correctness in Prosthodontics Using Metadata and Hallucination Signals

Predicting LLM Correctness in Prosthodontics Using Metadata and Hallucination Signals

URL: http://arxiv.org/abs/2512.22508v1
Date: Sat, 27 Dec 2025 07:51:50 GMT
Title: Predicting LLM Correctness in Prosthodontics Using Metadata and Hallucination Signals
Authors: Lucky Susanto, Anasta Pranawijayana, Cortino Sukotjo, Soni Prasad, Derry Wijaya,
Abstract summary: Large language models (LLMs) are increasingly adopted in high-stakes domains such as healthcare and medical education.<n>This study investigates the feasibility of predicting correctness by analyzing a general-purpose model (GPT-4o) and a reasoning-centric model (OSS-120B) on a multiple-choice prosthodontics exam.<n>Our findings demonstrate that this metadata-based approach can improve accuracy by up to +7.14% and achieve a precision of 83.12% over a baseline that assumes all answers are correct.
Score: 4.680384065487852
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models (LLMs) are increasingly adopted in high-stakes domains such as healthcare and medical education, where the risk of generating factually incorrect (i.e., hallucinated) information is a major concern. While significant efforts have been made to detect and mitigate such hallucinations, predicting whether an LLM's response is correct remains a critical yet underexplored problem. This study investigates the feasibility of predicting correctness by analyzing a general-purpose model (GPT-4o) and a reasoning-centric model (OSS-120B) on a multiple-choice prosthodontics exam. We utilize metadata and hallucination signals across three distinct prompting strategies to build a correctness predictor for each (model, prompting) pair. Our findings demonstrate that this metadata-based approach can improve accuracy by up to +7.14% and achieve a precision of 83.12% over a baseline that assumes all answers are correct. We further show that while actual hallucination is a strong indicator of incorrectness, metadata signals alone are not reliable predictors of hallucination. Finally, we reveal that prompting strategies, despite not affecting overall accuracy, significantly alter the models' internal behaviors and the predictive utility of their metadata. These results present a promising direction for developing reliability signals in LLMs but also highlight that the methods explored in this paper are not yet robust enough for critical, high-stakes deployment.

Related papers

Lie to Me: Knowledge Graphs for Robust Hallucination Self-Detection in LLMs [0.0]
We examine the use of structured knowledge representations, namely knowledge graphs, to improve hallucination self-detection.<n>Our results show that LLMs can better analyse atomic facts when they are structured as knowledge graphs.<n>This low-cost, model-agnostic approach paves the way toward safer and more trustworthy language models.
arXiv Detail & Related papers (2025-12-29T15:41:13Z)
Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning [32.32593439144886]
Behavior-calibrated reinforcement learning allows smaller models to surpass frontier models in uncertainty quantification.<n>Our model's log-scale Accuracy-to-Hallucination Ratio gain (0.806) exceeds GPT-5's (0.207) in a challenging in-domain evaluation.
arXiv Detail & Related papers (2025-12-22T22:51:48Z)
Revisiting Multivariate Time Series Forecasting with Missing Values [65.30332997607141]
Missing values are common in real-world time series.<n>Current approaches have developed an imputation-then-prediction framework that uses imputation modules to fill in missing values, followed by forecasting on the imputed data.<n>This framework overlooks a critical issue: there is no ground truth for the missing values, making the imputation process susceptible to errors that can degrade prediction accuracy.<n>We introduce Consistency-Regularized Information Bottleneck (CRIB), a novel framework built on the Information Bottleneck principle.
arXiv Detail & Related papers (2025-09-27T20:57:48Z)
Shaking to Reveal: Perturbation-Based Detection of LLM Hallucinations [25.18901449626428]
A widely adopted strategy to detect hallucination, known as self-assessment, relies on the model's own output confidence to estimate the factual accuracy of its answers.<n>We propose Sample-Specific Prompting (SSP), a new framework that improves self-assessment by analyzing perturbation sensitivity at intermediate representations.<n>SSP significantly outperforms prior methods across a range of hallucination detection benchmarks.
arXiv Detail & Related papers (2025-06-03T09:44:28Z)
HuDEx: Integrating Hallucination Detection and Explainability for Enhancing the Reliability of LLM responses [0.12499537119440242]
This paper proposes an explanation enhanced hallucination-detection model, coined as HuDEx.<n>The proposed model provides a novel approach to integrate detection with explanations, and enable both users and the LLM itself to understand and reduce errors.
arXiv Detail & Related papers (2025-02-12T04:17:02Z)
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
Hallucination Detox: Sensitivity Dropout (SenD) for Large Language Model Training [17.347067827911406]
We investigate the relationship between the uncertainty in training dynamics and the emergence of hallucinations.<n>We propose Sensitivity Dropout (SenD), a novel training protocol designed to reduce hallucination variance during training.<n>SenD improves test-time reliability of Pythia and Meta's Llama models by up to 17% and enhances factual accuracy in Wikipedia, Medical, Legal, and Coding domains.
arXiv Detail & Related papers (2024-10-20T18:18:23Z)
Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models. This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution. We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z)
Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation. We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation. Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z)
UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model. UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data. We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD) UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z)
Unlabelled Data Improves Bayesian Uncertainty Calibration under Covariate Shift [100.52588638477862]
We develop an approximate Bayesian inference scheme based on posterior regularisation. We demonstrate the utility of our method in the context of transferring prognostic models of prostate cancer across globally diverse populations.
arXiv Detail & Related papers (2020-06-26T13:50:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.