A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist
- URL: http://arxiv.org/abs/2510.19139v2
- Date: Sun, 26 Oct 2025 01:38:30 GMT
- Title: A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist
- Authors: Sohyeon Jeon, Hyung-Chul Lee,
- Abstract summary: This study applies a behavioral and metacognitive analytic approach using an expert-validated dataset.<n>We analyze both cognitive adaptation and calibration error using metrics: Expected Error (ECE) and a baseline-normalized Relative Error (RCE)<n>Our results reveal pronounced miscalibration and overconfidence in both models, especially under clinical role-playing conditions.
- Score: 1.1731001328350983
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the rapid expansion of Large Language Models (LLMs) in healthcare, robust and explainable evaluation of their ability to assess clinical trial reporting according to CONSORT standards remains an open challenge. In particular, uncertainty calibration and metacognitive reliability of LLM reasoning are poorly understood and underexplored in medical automation. This study applies a behavioral and metacognitive analytic approach using an expert-validated dataset, systematically comparing two representative LLMs - one general and one domain-specialized - across three prompt strategies. We analyze both cognitive adaptation and calibration error using metrics: Expected Calibration Error (ECE) and a baseline-normalized Relative Calibration Error (RCE) that enables reliable cross-model comparison. Our results reveal pronounced miscalibration and overconfidence in both models, especially under clinical role-playing conditions, with calibration error persisting above clinically relevant thresholds. These findings underscore the need for improved calibration, transparent code, and strategic prompt engineering to develop reliable and explainable medical AI.
Related papers
- Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification [60.18369393468405]
Existing verifiers usually underperform owing to a lack of domain knowledge and limited calibration.<n>GLEAN compiles expert-curated protocols into trajectory-informed, well-calibrated correctness signals.<n>We empirically validate GLEAN with agentic clinical diagnosis across three diseases from the MIMIC-IV dataset.
arXiv Detail & Related papers (2026-03-03T09:36:43Z) - Towards Reliable Medical LLMs: Benchmarking and Enhancing Confidence Estimation of Large Language Models in Medical Consultation [97.36081721024728]
We propose the first benchmark for assessing confidence in multi-turn interaction during realistic medical consultations.<n>Our benchmark unifies three types of medical data for open-ended diagnostic generation.<n>We present MedConf, an evidence-grounded linguistic self-assessment framework.
arXiv Detail & Related papers (2026-01-22T04:51:39Z) - ErrEval: Error-Aware Evaluation for Question Generation through Explicit Diagnostics [30.569255227942634]
We propose ErrEval, a flexible and Error-aware Evaluation framework that enhances QG evaluation through explicit error diagnostics.<n>ErrEval reformulates evaluation as a two-stage process of error diagnosis followed by informed scoring.
arXiv Detail & Related papers (2026-01-15T13:57:15Z) - Benchmarking Egocentric Clinical Intent Understanding Capability for Medical Multimodal Large Language Models [48.95516224614331]
We introduce MedGaze-Bench, the first benchmark leveraging clinician gaze as a Cognitive Cursor to assess intent understanding across surgery, emergency simulation, and diagnostic interpretation.<n>Our benchmark addresses three fundamental challenges: visual homogeneity of anatomical structures, strict temporal-causal dependencies in clinical, and implicit adherence to safety protocols.
arXiv Detail & Related papers (2026-01-11T02:20:40Z) - Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models [51.91760712805404]
We introduce VivaBench, a benchmark for evaluating sequential clinical reasoning in large language models (LLMs)<n>Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training.<n>Our analysis identified several failure modes that mirror common cognitive errors in clinical practice.
arXiv Detail & Related papers (2025-10-11T16:24:35Z) - Toward Reliable Clinical Coding with Language Models: Verification and Lightweight Adaptation [3.952186976672079]
We show that lightweight interventions, including prompt engineering and small-scale fine-tuning, can improve accuracy without the computational overhead of search-based methods.<n>To address hierarchically near-miss errors, we introduce clinical code verification as both a standalone task and a pipeline component.
arXiv Detail & Related papers (2025-10-08T23:50:58Z) - RAD: Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis [56.373297358647655]
Retrieval-Augmented Diagnosis (RAD) is a novel framework that injects external knowledge into multimodal models directly on downstream tasks.<n>RAD operates through three key mechanisms: retrieval and refinement of disease-centered knowledge from multiple medical sources, a guideline-enhanced contrastive loss transformer, and a dual decoder.
arXiv Detail & Related papers (2025-09-24T10:36:14Z) - How to Evaluate Medical AI [4.23552814358972]
We introduce Relative Precision and Recall of Algorithmic Diagnostics (RPAD and RRAD)<n>RPAD and RRAD compare AI outputs against multiple expert opinions rather than a single reference.<n>Large-scale study shows that top-performing models, such as DeepSeek-V3, achieve consistency on par with or exceeding expert consensus.
arXiv Detail & Related papers (2025-09-15T14:01:22Z) - Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications [59.721265428780946]
Large Language Models (LLMs) in medicine have enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning.<n>This paper provides the first systematic review of this emerging field.<n>We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies and test-time mechanisms.
arXiv Detail & Related papers (2025-08-01T14:41:31Z) - Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z) - No Black Boxes: Interpretable and Interactable Predictive Healthcare with Knowledge-Enhanced Agentic Causal Discovery [12.134852642539407]
We propose II-KEA, a knowledge-enhanced agent-driven causal discovery framework.<n> II-KEA enhances interpretability through explicit reasoning and causal analysis.<n>It also improves interactivity by allowing clinicians to inject their knowledge and experience through customized knowledge bases and prompts.
arXiv Detail & Related papers (2025-05-22T06:36:30Z) - Large Language Models for Outpatient Referral: Problem Definition, Benchmarking and Challenges [34.10494503049667]
Large language models (LLMs) are increasingly applied to outpatient referral tasks across healthcare systems.<n>There is a lack of standardized evaluation criteria to assess their effectiveness.<n>We propose a comprehensive evaluation framework specifically designed for such systems.
arXiv Detail & Related papers (2025-03-11T11:05:42Z) - Evaluating Machine Learning Models against Clinical Protocols for Enhanced Interpretability and Continuity of Care [39.58317527488534]
In clinical practice, decision-making relies heavily on established protocols, often formalised as rules.
Despite the growing number of Machine Learning applications, their adoption into clinical practice remains limited.
We propose metrics to assess the accuracy of ML models with respect to the established protocol.
arXiv Detail & Related papers (2024-11-05T13:50:09Z) - Reasoning-Enhanced Healthcare Predictions with Knowledge Graph Community Retrieval [61.70489848327436]
KARE is a novel framework that integrates knowledge graph (KG) community-level retrieval with large language models (LLMs) reasoning.<n>Extensive experiments demonstrate that KARE outperforms leading models by up to 10.8-15.0% on MIMIC-III and 12.6-12.7% on MIMIC-IV for mortality and readmission predictions.
arXiv Detail & Related papers (2024-10-06T18:46:28Z) - Methodological Explainability Evaluation of an Interpretable Deep Learning Model for Post-Hepatectomy Liver Failure Prediction Incorporating Counterfactual Explanations and Layerwise Relevance Propagation: A Prospective In Silico Trial [13.171582596404313]
We developed a variational autoencoder-multilayer perceptron (VAE-MLP) model for preoperative PHLF prediction.
This model integrated counterfactuals and layerwise relevance propagation (LRP) to provide insights into its decision-making mechanism.
Results from the three-track in silico clinical trial showed that clinicians' prediction accuracy and confidence increased when AI explanations were provided.
arXiv Detail & Related papers (2024-08-07T13:47:32Z) - Unified Uncertainty Estimation for Cognitive Diagnosis Models [70.46998436898205]
We propose a unified uncertainty estimation approach for a wide range of cognitive diagnosis models.
We decompose the uncertainty of diagnostic parameters into data aspect and model aspect.
Our method is effective and can provide useful insights into the uncertainty of cognitive diagnosis.
arXiv Detail & Related papers (2024-03-09T13:48:20Z) - Informing clinical assessment by contextualizing post-hoc explanations
of risk prediction models in type-2 diabetes [50.8044927215346]
We consider a comorbidity risk prediction scenario and focus on contexts regarding the patients clinical state.
We employ several state-of-the-art LLMs to present contexts around risk prediction model inferences and evaluate their acceptability.
Our paper is one of the first end-to-end analyses identifying the feasibility and benefits of contextual explanations in a real-world clinical use case.
arXiv Detail & Related papers (2023-02-11T18:07:11Z) - VBridge: Connecting the Dots Between Features, Explanations, and Data
for Healthcare Models [85.4333256782337]
VBridge is a visual analytics tool that seamlessly incorporates machine learning explanations into clinicians' decision-making workflow.
We identified three key challenges, including clinicians' unfamiliarity with ML features, lack of contextual information, and the need for cohort-level evidence.
We demonstrated the effectiveness of VBridge through two case studies and expert interviews with four clinicians.
arXiv Detail & Related papers (2021-08-04T17:34:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.