Correctness Coverage Evaluation for Medical Multiple-Choice Question   Answering Based on the Enhanced Conformal Prediction Framework
        - URL: http://arxiv.org/abs/2503.05505v2
 - Date: Thu, 08 May 2025 16:52:55 GMT
 - Title: Correctness Coverage Evaluation for Medical Multiple-Choice Question   Answering Based on the Enhanced Conformal Prediction Framework
 - Authors: Yusong Ke, Hongru Lin, Yuting Ruan, Junya Tang, Li Li, 
 - Abstract summary: Large language models (LLMs) are increasingly adopted in medical question-answering (QA) scenarios.<n>LLMs can generate hallucinations and nonfactual information, undermining their trustworthiness in high-stakes medical tasks.<n>This paper proposes an enhanced Conformal Prediction framework for medical multiple-choice question-answering tasks.
 - Score: 2.9599960287815144
 - License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
 - Abstract:   Large language models (LLMs) are increasingly adopted in medical question-answering (QA) scenarios. However, LLMs can generate hallucinations and nonfactual information, undermining their trustworthiness in high-stakes medical tasks. Conformal Prediction (CP) provides a statistically rigorous framework for marginal (average) coverage guarantees but has limited exploration in medical QA. This paper proposes an enhanced CP framework for medical multiple-choice question-answering (MCQA) tasks. By associating the non-conformance score with the frequency score of correct options and leveraging self-consistency, the framework addresses internal model opacity and incorporates a risk control strategy with a monotonic loss function. Evaluated on MedMCQA, MedQA, and MMLU datasets using four off-the-shelf LLMs, the proposed method meets specified error rate guarantees while reducing average prediction set size with increased risk level, offering a promising uncertainty evaluation metric for LLMs. 
 
       
      
        Related papers
        - Conformal Sets in Multiple-Choice Question Answering under Black-Box   Settings with Provable Coverage Guarantees [5.09580026885155]
We propose a frequency-based uncertainty quantification method under black-box settings.<n>Our approach involves multiple independent samplings of the model's output distribution for each input.<n>We show that frequency-based PE outperforms logit-based PE in distinguishing between correct and incorrect predictions.
arXiv  Detail & Related papers  (2025-08-07T16:22:49Z) - Conformal Information Pursuit for Interactively Guiding Large Language   Models [64.39770942422288]
This paper explores sequential querying strategies that aim to minimize the expected number of queries.<n>One such strategy is Information Pursuit (IP), a greedy algorithm that at each iteration selects the query that maximizes information gain or equivalently minimizes uncertainty.<n>We propose Conformal Information Pursuit (C-IP), an alternative approach to sequential information gain based on conformal prediction sets.
arXiv  Detail & Related papers  (2025-07-04T03:55:39Z) - COIN: Uncertainty-Guarding Selective Question Answering for Foundation   Models with Provable Risk Guarantees [51.5976496056012]
COIN is an uncertainty-guarding selection framework that calibrates statistically valid thresholds to filter a single generated answer per question.<n>COIN estimates the empirical error rate on a calibration set and applies confidence interval methods to establish a high-probability upper bound on the true error rate.<n>We demonstrate COIN's robustness in risk control, strong test-time power in retaining admissible answers, and predictive efficiency under limited calibration data.
arXiv  Detail & Related papers  (2025-06-25T07:04:49Z) - Data-Driven Calibration of Prediction Sets in Large Vision-Language   Models Based on Inductive Conformal Prediction [0.0]
We propose a model-agnostic uncertainty quantification method that integrates dynamic threshold calibration and cross-modal consistency verification.
We show that the framework achieves stable performance across varying calibration-to-test split ratios, underscoring its robustness for real-world deployment in healthcare, autonomous systems, and other safety-sensitive domains.
This work bridges the gap between theoretical reliability and practical applicability in multi-modal AI systems, offering a scalable solution for hallucination detection and uncertainty-aware decision-making.
arXiv  Detail & Related papers  (2025-04-24T15:39:46Z) - Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
The reliability and accuracy of large language models (LLMs) in medical contexts remain critical concerns.<n>Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance.<n>We propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges.
arXiv  Detail & Related papers  (2025-04-21T16:51:11Z) - PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably.<n>This poses a significant challenge to ensuring their safe deployment.<n>We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv  Detail & Related papers  (2025-02-20T10:52:38Z) - Monty Hall and Optimized Conformal Prediction to Improve Decision-Making   with LLMs [7.843594672029363]
Con conformal prediction (CP) is a model-agnostic framework for distribution-free uncertainty quantification.<n>We introduce CP-OPT, an optimization framework to learn scores that minimize set sizes while maintaining coverage.<n>We also propose emphconformal revision of questions (CROQ) to revise the problem by narrowing down the available choices to those in the prediction set.
arXiv  Detail & Related papers  (2024-12-31T17:33:12Z) - Distribution-Free Uncertainty Quantification in Mechanical Ventilation   Treatment: A Conformal Deep Q-Learning Framework [2.5070297884580874]
This study introduces ConformalDQN, a distribution-free conformal deep Q-learning approach for optimizing mechanical ventilation in intensive care units.<n>We trained and evaluated our model using ICU patient records from the MIMIC-IV database.
arXiv  Detail & Related papers  (2024-12-17T06:55:20Z) - Comprehensive and Practical Evaluation of Retrieval-Augmented Generation   Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs)
We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets.
Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv  Detail & Related papers  (2024-11-14T06:19:18Z) - Evaluating language models as risk scores [23.779329697527054]
We introduce folktexts, a software package to generate risk scores using question-answering LLMs.
We evaluate 17 recent LLMs across five proposed benchmark tasks.
We find that zero-shot risk scores produced by multiple-choice question-answering have high predictive signal but are widely miscalibrated.
arXiv  Detail & Related papers  (2024-07-19T18:13:37Z) - ConU: Conformal Uncertainty in Large Language Models with Correctness   Coverage Guarantees [68.33498595506941]
We introduce a novel uncertainty measure based on self-consistency theory.
We then develop a conformal uncertainty criterion by integrating the uncertainty condition aligned with correctness into the CP algorithm.
 Empirical evaluations indicate that our uncertainty measure outperforms prior state-of-the-art methods.
arXiv  Detail & Related papers  (2024-06-29T17:33:07Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv  Detail & Related papers  (2024-06-05T16:35:30Z) - Query Performance Prediction using Relevance Judgments Generated by   Large Language Models [53.97064615557883]
We propose a QPP framework using automatically generated relevance judgments (QPP-GenRE)
QPP-GenRE decomposes QPP into independent subtasks of predicting relevance of each item in a ranked list to a given query.
This allows us to predict any IR evaluation measure using the generated relevance judgments as pseudo-labels.
arXiv  Detail & Related papers  (2024-04-01T09:33:05Z) - Word-Sequence Entropy: Towards Uncertainty Estimation in Free-Form   Medical Question Answering Applications and Beyond [52.246494389096654]
This paper introduces Word-Sequence Entropy (WSE), a method that calibrates uncertainty at both the word and sequence levels.
We compare WSE with six baseline methods on five free-form medical QA datasets, utilizing seven popular large language models (LLMs)
arXiv  Detail & Related papers  (2024-02-22T03:46:08Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv  Detail & Related papers  (2023-12-14T19:09:22Z) - Improving Trustworthiness of AI Disease Severity Rating in Medical
  Imaging with Ordinal Conformal Prediction Sets [0.7734726150561088]
A lack of statistically rigorous uncertainty quantification is a significant factor undermining trust in AI results.
Recent developments in distribution-free uncertainty quantification present practical solutions for these issues.
We demonstrate a technique for forming ordinal prediction sets that are guaranteed to contain the correct stenosis severity.
arXiv  Detail & Related papers  (2022-07-05T18:01:20Z) - Modeling Disagreement in Automatic Data Labelling for Semi-Supervised
  Learning in Clinical Natural Language Processing [2.016042047576802]
We investigate the quality of uncertainty estimates from a range of current state-of-the-art predictive models applied to the problem of observation detection in radiology reports.
arXiv  Detail & Related papers  (2022-05-29T20:20:49Z) - Distribution-Free Federated Learning with Conformal Predictions [0.0]
Federated learning aims to leverage separate institutional datasets while maintaining patient privacy.
Poor calibration and lack of interpretability may hamper widespread deployment of federated models into clinical practice.
We propose to address these challenges by incorporating an adaptive conformal framework into federated learning.
arXiv  Detail & Related papers  (2021-10-14T18:41:17Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.