Related papers: Uncertainty as Feature Gaps: Epistemic Uncertainty Quantification of LLMs in Contextual Question-Answering

Uncertainty as Feature Gaps: Epistemic Uncertainty Quantification of LLMs in Contextual Question-Answering

URL: http://arxiv.org/abs/2510.02671v2
Date: Thu, 23 Oct 2025 21:43:24 GMT
Title: Uncertainty as Feature Gaps: Epistemic Uncertainty Quantification of LLMs in Contextual Question-Answering
Authors: Yavuz Bakman, Sungmin Kang, Zhiqi Huang, Duygu Nur Yaldiz, Catarina G. Belém, Chenyang Zhu, Anoop Kumar, Alfy Samuel, Salman Avestimehr, Daben Liu, Sai Praneeth Karimireddy,
Abstract summary: We introduce a task-agnostic, token-level uncertainty measure defined as the cross-entropy between the predictive distribution of the given model and the unknown true distribution.<n>We derive an upper bound for uncertainty and show that it can be interpreted as semantic feature gaps in the given model's hidden representations.<n>We apply this generic framework to the contextual QA task and hypothesize that three features approximate this gap: context-reliance, context comprehension, and honesty.
Score: 29.4458902836278
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Uncertainty Quantification (UQ) research has primarily focused on closed-book factual question answering (QA), while contextual QA remains unexplored, despite its importance in real-world applications. In this work, we focus on UQ for the contextual QA task and propose a theoretically grounded approach to quantify epistemic uncertainty. We begin by introducing a task-agnostic, token-level uncertainty measure defined as the cross-entropy between the predictive distribution of the given model and the unknown true distribution. By decomposing this measure, we isolate the epistemic component and approximate the true distribution by a perfectly prompted, idealized model. We then derive an upper bound for epistemic uncertainty and show that it can be interpreted as semantic feature gaps in the given model's hidden representations relative to the ideal model. We further apply this generic framework to the contextual QA task and hypothesize that three features approximate this gap: context-reliance (using the provided context rather than parametric knowledge), context comprehension (extracting relevant information from context), and honesty (avoiding intentional lies). Using a top-down interpretability approach, we extract these features by using only a small number of labeled samples and ensemble them to form a robust uncertainty score. Experiments on multiple QA benchmarks in both in-distribution and out-of-distribution settings show that our method substantially outperforms state-of-the-art unsupervised (sampling-free and sampling-based) and supervised UQ methods, achieving up to a 13-point PRR improvement while incurring a negligible inference overhead.

Related papers

The Illusion of Certainty: Uncertainty quantification for LLMs fails under ambiguity [48.899855816199484]
We introduce MAQA* and AmbigQA*, the first ambiguous question-answering (QA) datasets equipped with ground-truth answer distributions.<n>We show that predictive-distribution and ensemble-based estimators are fundamentally limited under ambiguity.
arXiv Detail & Related papers (2025-11-06T14:46:35Z)
The Consistency Hypothesis in Uncertainty Quantification for Large Language Models [22.60039074743706]
Black-box uncertainty quantification (UQ) methods, relying solely on model API access, have gained popularity due to their practical benefits.<n>In this paper, we examine the implicit assumption behind several UQ methods, which use generation consistency as a proxy for confidence.<n>We propose data-free black-box UQ methods that aggregate similarities between generations for confidence estimation.
arXiv Detail & Related papers (2025-06-27T01:53:15Z)
COIN: Uncertainty-Guarding Selective Question Answering for Foundation Models with Provable Risk Guarantees [51.5976496056012]
COIN is an uncertainty-guarding selection framework that calibrates statistically valid thresholds to filter a single generated answer per question.<n>COIN estimates the empirical error rate on a calibration set and applies confidence interval methods to establish a high-probability upper bound on the true error rate.<n>We demonstrate COIN's robustness in risk control, strong test-time power in retaining admissible answers, and predictive efficiency under limited calibration data.
arXiv Detail & Related papers (2025-06-25T07:04:49Z)
Inv-Entropy: A Fully Probabilistic Framework for Uncertainty Quantification in Language Models [5.6672926445919165]
Large language models (LLMs) have transformed natural language processing, but their reliable deployment requires effective uncertainty quantification (UQ)<n>Existing UQ methods are often and lack a probabilistic foundation.<n>We propose a fully probabilistic framework based on an inverse model, which quantifies uncertainty by evaluating the diversity of the input space conditioned on a given output through systematic perturbations.
arXiv Detail & Related papers (2025-06-11T13:02:17Z)
UncertainSAM: Fast and Efficient Uncertainty Quantification of the Segment Anything Model [19.8785302359805]
This paper presents a theoretically motivated uncertainty quantification model based on a Bayesian entropy formulation.<n>We use this formulation to train USAM, a lightweight post-hoc UQ method.<n>Our proposed deterministic USAM demonstrates superior predictive capabilities on the SA-V, MOSE, ADE20k, DAVIS, and COCO datasets.
arXiv Detail & Related papers (2025-05-08T08:36:23Z)
SConU: Selective Conformal Uncertainty in Large Language Models [59.25881667640868]
We propose a novel approach termed Selective Conformal Uncertainty (SConU)<n>We develop two conformal p-values that are instrumental in determining whether a given sample deviates from the uncertainty distribution of the calibration set at a specific manageable risk level.<n>Our approach not only facilitates rigorous management of miscoverage rates across both single-domain and interdisciplinary contexts, but also enhances the efficiency of predictions.
arXiv Detail & Related papers (2025-04-19T03:01:45Z)
Uncertainty Quantification for LLMs through Minimum Bayes Risk: Bridging Confidence and Consistency [66.96286531087549]
Uncertainty quantification (UQ) methods for Large Language Models (LLMs) encompass a variety of approaches.<n>We propose a novel approach to integrating model confidence with output consistency, resulting in a family of efficient and robust UQ methods.<n>We evaluate our approach across various tasks such as question answering, abstractive summarization, and machine translation.
arXiv Detail & Related papers (2025-02-07T14:30:12Z)
Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling [69.83976050879318]
In large language models (LLMs), identifying sources of uncertainty is an important step toward improving reliability, trustworthiness, and interpretability. In this paper, we introduce an uncertainty decomposition framework for LLMs, called input clarification ensembling. Our approach generates a set of clarifications for the input, feeds them into an LLM, and ensembles the corresponding predictions.
arXiv Detail & Related papers (2023-11-15T05:58:35Z)
Distributional Shift-Aware Off-Policy Interval Estimation: A Unified Error Quantification Framework [8.572441599469597]
We study high-confidence off-policy evaluation in the context of infinite-horizon Markov decision processes. The objective is to establish a confidence interval (CI) for the target policy value using only offline data pre-collected from unknown behavior policies. We show that our algorithm is sample-efficient, error-robust, and provably convergent even in non-linear function approximation settings.
arXiv Detail & Related papers (2023-09-23T06:35:44Z)
Towards Clear Expectations for Uncertainty Estimation [64.20262246029286]
Uncertainty Quantification (UQ) is crucial to achieve trustworthy Machine Learning (ML) Most UQ methods suffer from disparate and inconsistent evaluation protocols. This opinion paper offers a new perspective by specifying those requirements through five downstream tasks.
arXiv Detail & Related papers (2022-07-27T07:50:57Z)
What is Flagged in Uncertainty Quantification? Latent Density Models for Uncertainty Categorization [68.15353480798244]
Uncertainty Quantification (UQ) is essential for creating trustworthy machine learning models. Recent years have seen a steep rise in UQ methods that can flag suspicious examples. We propose a framework for categorizing uncertain examples flagged by UQ methods in classification tasks.
arXiv Detail & Related papers (2022-07-11T19:47:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.