Related papers: Conformal P-Value in Multiple-Choice Question Answering Tasks with Provable Risk Control

Conformal P-Value in Multiple-Choice Question Answering Tasks with Provable Risk Control

URL: http://arxiv.org/abs/2508.10022v1
Date: Thu, 07 Aug 2025 16:46:47 GMT
Title: Conformal P-Value in Multiple-Choice Question Answering Tasks with Provable Risk Control
Authors: Yuanchang Ye,
Abstract summary: This study introduces a significance testing-enhanced conformal prediction (CP) framework to improve trustworthiness of large language models (LLMs) in multiple-choice question answering (MCQA)<n>CP provides statistically rigorous marginal coverage guarantees for prediction sets, and significance testing offers established statistical rigor.<n>This work establishes a statistical framework for trustworthy LLM deployment in high-stakes QA applications.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This study introduces a significance testing-enhanced conformal prediction (CP) framework to improve trustworthiness of large language models (LLMs) in multiple-choice question answering (MCQA). While LLMs have been increasingly deployed in disciplinary QA scenarios, hallucination and nonfactual generation substantially compromise response reliability. Although CP provides statistically rigorous marginal coverage guarantees for prediction sets, and significance testing offers established statistical rigor, their synergistic integration remains unexplored. To mitigate hallucination and factual inaccuracies, our framework integrates $p$-value computation with conformity scoring through self-consistency resampling of MCQA responses. This approach calculates option frequencies to address LLMs' black-box nature, subsequently constructing prediction sets via null hypothesis testing ($\mathcal{H}_0$) with empirically derived $p$-values. Evaluations on MMLU and MMLU-Pro benchmarks using off-the-shelf LLMs demonstrate: (1) The enhanced CP achieves user-specified empirical miscoverage rates; (2) Test-set average prediction set size (APSS) decreases monotonically with increasing risk levels ($\alpha$), validating APSS as an effective uncertainty metric. This work establishes a principled statistical framework for trustworthy LLM deployment in high-stakes QA applications.

Related papers

Conformal Sets in Multiple-Choice Question Answering under Black-Box Settings with Provable Coverage Guarantees [5.09580026885155]
We propose a frequency-based uncertainty quantification method under black-box settings.<n>Our approach involves multiple independent samplings of the model's output distribution for each input.<n>We show that frequency-based PE outperforms logit-based PE in distinguishing between correct and incorrect predictions.
arXiv Detail & Related papers (2025-08-07T16:22:49Z)
COIN: Uncertainty-Guarding Selective Question Answering for Foundation Models with Provable Risk Guarantees [51.5976496056012]
COIN is an uncertainty-guarding selection framework that calibrates statistically valid thresholds to filter a single generated answer per question.<n>COIN estimates the empirical error rate on a calibration set and applies confidence interval methods to establish a high-probability upper bound on the true error rate.<n>We demonstrate COIN's robustness in risk control, strong test-time power in retaining admissible answers, and predictive efficiency under limited calibration data.
arXiv Detail & Related papers (2025-06-25T07:04:49Z)
Data-Driven Calibration of Prediction Sets in Large Vision-Language Models Based on Inductive Conformal Prediction [0.0]
We propose a model-agnostic uncertainty quantification method that integrates dynamic threshold calibration and cross-modal consistency verification.<n>We show that the framework achieves stable performance across varying calibration-to-test split ratios, underscoring its robustness for real-world deployment in healthcare, autonomous systems, and other safety-sensitive domains.<n>This work bridges the gap between theoretical reliability and practical applicability in multi-modal AI systems, offering a scalable solution for hallucination detection and uncertainty-aware decision-making.
arXiv Detail & Related papers (2025-04-24T15:39:46Z)
Correctness Coverage Evaluation for Medical Multiple-Choice Question Answering Based on the Enhanced Conformal Prediction Framework [2.9599960287815144]
Large language models (LLMs) are increasingly adopted in medical question-answering (QA) scenarios.<n>LLMs can generate hallucinations and nonfactual information, undermining their trustworthiness in high-stakes medical tasks.<n>This paper proposes an enhanced Conformal Prediction framework for medical multiple-choice question-answering tasks.
arXiv Detail & Related papers (2025-03-07T15:22:10Z)
PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably.<n>This poses a significant challenge to ensuring their safe deployment.<n>We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z)
Prune 'n Predict: Optimizing LLM Decision-making with Conformal Prediction [7.843594672029363]
incorrect outputs pose significant risks in high-stakes domains like healthcare and finance.<n>We propose emphconformal revision of questions (CROQ), which revises the question by narrowing down the available choices to those in the prediction set.<n>We also propose CP-OPT, an optimization framework to learn scores that minimize set sizes while maintaining coverage.
arXiv Detail & Related papers (2024-12-31T17:33:12Z)
Addressing Uncertainty in LLMs to Enhance Reliability in Generative AI [47.64301863399763]
We present a dynamic semantic clustering approach inspired by the Chinese Restaurant Process. We quantify uncertainty of Large Language Models (LLMs) on a given query by calculating entropy of the generated semantic clusters. We propose leveraging the (negative) likelihood of these clusters as the (non)conformity score within Conformal Prediction framework.
arXiv Detail & Related papers (2024-11-04T18:49:46Z)
ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees [68.33498595506941]
We introduce a novel uncertainty measure based on self-consistency theory. We then develop a conformal uncertainty criterion by integrating the uncertainty condition aligned with correctness into the CP algorithm. Empirical evaluations indicate that our uncertainty measure outperforms prior state-of-the-art methods.
arXiv Detail & Related papers (2024-06-29T17:33:07Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.