Language Model Uncertainty Quantification with Attention Chain
- URL: http://arxiv.org/abs/2503.19168v2
- Date: Thu, 07 Aug 2025 14:58:09 GMT
- Title: Language Model Uncertainty Quantification with Attention Chain
- Authors: Yinghao Li, Rushi Qiang, Lama Moukheiber, Chao Zhang,
- Abstract summary: Large language models' (LLM) predictive uncertainty is crucial for judging the reliability of its answers.<n>We propose UQAC, an efficient method that narrows the reasoning space to a tractable size for marginalization.<n>We validate UQAC on multiple reasoning benchmarks with advanced open-source LLMs.
- Score: 9.093726246465117
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurately quantifying a large language model's (LLM) predictive uncertainty is crucial for judging the reliability of its answers. While most existing research focuses on short, directly answerable questions with closed-form outputs (e.g., multiple-choice), involving intermediate reasoning steps in LLM responses is increasingly important. This added complexity complicates uncertainty quantification (UQ) because the probabilities assigned to answer tokens are conditioned on a vast space of preceding reasoning tokens. Direct marginalization is infeasible, and the dependency inflates probability estimates, causing overconfidence in UQ. To address this, we propose UQAC, an efficient method that narrows the reasoning space to a tractable size for marginalization. UQAC iteratively constructs an "attention chain" of tokens deemed "semantically crucial" to the final answer via a backtracking procedure. Starting from the answer tokens, it uses attention weights to identify the most influential predecessors, then iterates this process until reaching the input tokens. The resulting chain is further refined with similarity filtering and probability thresholding, which reduce the reasoning space, facilitating the approximation of the marginal answer token probabilities. We validate UQAC on multiple reasoning benchmarks with advanced open-source LLMs, demonstrating that it consistently delivers reliable UQ estimates with high computational efficiency.
Related papers
- COIN: Uncertainty-Guarding Selective Question Answering for Foundation Models with Provable Risk Guarantees [51.5976496056012]
COIN is an uncertainty-guarding selection framework that calibrates statistically valid thresholds to filter a single generated answer per question.<n>COIN estimates the empirical error rate on a calibration set and applies confidence interval methods to establish a high-probability upper bound on the true error rate.<n>We demonstrate COIN's robustness in risk control, strong test-time power in retaining admissible answers, and predictive efficiency under limited calibration data.
arXiv Detail & Related papers (2025-06-25T07:04:49Z) - Multipole Attention for Efficient Long Context Reasoning [64.94673641704289]
Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks.<n>LRMs need to generate long chain-of-thought reasoning in order to think before answering.<n>We introduce Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens.
arXiv Detail & Related papers (2025-06-16T03:00:40Z) - Uncertainty-Aware Attention Heads: Efficient Unsupervised Uncertainty Quantification for LLMs [114.46334319795785]
Large language models (LLMs) exhibit impressive fluency, but often produce critical errors known as "hallucinations"<n>We propose RAUQ (Recurrent Attention-based Uncertainty Quantification), an unsupervised approach that leverages intrinsic attention patterns in transformers to detect hallucinations efficiently.<n> Experiments across 4 LLMs and 12 question answering, summarization, and translation tasks demonstrate that RAUQ yields excellent results.
arXiv Detail & Related papers (2025-05-26T14:28:37Z) - Self-Training Elicits Concise Reasoning in Large Language Models [23.475414693530965]
Chain-of-thought (CoT) reasoning has enabled large language models (LLMs) to utilize additional computation through intermediate tokens.<n>We propose simple fine-tuning methods which leverage self-generated concise reasoning paths.<n>Our method achieves a 30% reduction in output tokens, across five model families on GSM8K and MATH, while maintaining average accuracy.
arXiv Detail & Related papers (2025-02-27T14:14:50Z) - Scalable Best-of-N Selection for Large Language Models via Self-Certainty [65.31658824274894]
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models.
We propose self-certainty, a novel and efficient metric to estimate response quality without requiring external reward models.
Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities.
arXiv Detail & Related papers (2025-02-25T19:08:07Z) - Uncertainty Quantification in Retrieval Augmented Question Answering [57.05827081638329]
We propose to quantify the uncertainty of a QA model via estimating the utility of the passages it is provided with.<n>We train a lightweight neural model to predict passage utility for a target QA model and show that while simple information theoretic metrics can predict answer correctness up to a certain extent, our approach efficiently approximates or outperforms more expensive sampling-based methods.
arXiv Detail & Related papers (2025-02-25T11:24:52Z) - CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought [10.166370877826486]
Large language models (LLMs) excel in many tasks but struggle to accurately quantify uncertainty in their generated responses.<n>Existing uncertainty quantification (UQ) methods for LLMs are primarily prompt-wise rather than response-wise, which incurs high computational costs.<n>We propose CoT-UQ, a response-wise UQ framework that integrates LLMs' inherent reasoning capabilities through Chain-of-Thought (CoT) into the UQ process.
arXiv Detail & Related papers (2025-02-24T14:48:06Z) - Uncertainty Quantification for LLMs through Minimum Bayes Risk: Bridging Confidence and Consistency [66.96286531087549]
Uncertainty quantification (UQ) methods for Large Language Models (LLMs) encompass a variety of approaches.<n>We propose a novel approach to integrating model confidence with output consistency, resulting in a family of efficient and robust UQ methods.<n>We evaluate our approach across various tasks such as question answering, abstractive summarization, and machine translation.
arXiv Detail & Related papers (2025-02-07T14:30:12Z) - Not all tokens are created equal: Perplexity Attention Weighted Networks for AI generated text detection [49.15148871877941]
Next-token distribution outputs offer a theoretically appealing approach for detection of large language models (LLMs)<n>We propose the Perplexity Attention Weighted Network (PAWN), which uses the last hidden states of the LLM and positions to weight the sum of a series of features based on metrics from the next-token distribution across the sequence length.<n>PAWN shows competitive and even better performance in-distribution than the strongest baselines with a fraction of their trainable parameters.
arXiv Detail & Related papers (2025-01-07T17:00:49Z) - Path-Consistency: Prefix Enhancement for Efficient Inference in LLM [3.309813585671485]
textitpath-consistency mitigates both the errors and redundancies from random or less useful sampling in self-consistency.<n>textitpath-consistency achieves significant acceleration in inference latency ranging from $7.8%$ to $40.5%$.
arXiv Detail & Related papers (2024-08-25T01:45:53Z) - Do Not Design, Learn: A Trainable Scoring Function for Uncertainty Estimation in Generative LLMs [32.672370840879616]
Learnable Response Scoring (LARS) is a novel scoring function that leverages supervised data to capture complex dependencies between tokens and probabilities.
Our experiments demonstrate that LARS significantly outperforms existing scoring functions, achieving improvements of up to 16% AUROC score.
arXiv Detail & Related papers (2024-06-17T07:30:40Z) - Contextualized Sequence Likelihood: Enhanced Confidence Scores for Natural Language Generation [37.63939774027709]
We propose enhancing the predicted sequence probability by assigning different weights to various tokens.
We refer to this new score as the Contextualized Sequence Likelihood (CSL)
arXiv Detail & Related papers (2024-06-03T21:55:07Z) - Mitigating Misleading Chain-of-Thought Reasoning with Selective Filtering [59.495717939664246]
Large language models have manifested remarkable capabilities by leveraging chain-of-thought (CoT) reasoning techniques to solve intricate questions.
We propose a novel approach called the selective filtering reasoner (SelF-Reasoner) that assesses the entailment relationship between the question and the candidate reasoning chain.
SelF-Reasoner improves the fine-tuned T5 baseline consistently over the ScienceQA, ECQA, and LastLetter tasks.
arXiv Detail & Related papers (2024-03-28T06:28:35Z) - Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks.
This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs.
We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z) - Analyzing Chain-of-Thought Prompting in Large Language Models via
Gradient-based Feature Attributions [10.621564997491808]
Chain-of-thought (CoT) prompting has been shown to empirically improve the accuracy of large language models.
We investigate whether CoT prompting affects the relative importances they assign to particular input tokens.
Our results indicate that while CoT prompting does not increase the magnitude of saliency scores attributed to semantically relevant tokens in the prompt, it increases the robustness of saliency scores to question perturbations and variations in model output.
arXiv Detail & Related papers (2023-07-25T08:51:30Z) - Towards Clear Expectations for Uncertainty Estimation [64.20262246029286]
Uncertainty Quantification (UQ) is crucial to achieve trustworthy Machine Learning (ML)
Most UQ methods suffer from disparate and inconsistent evaluation protocols.
This opinion paper offers a new perspective by specifying those requirements through five downstream tasks.
arXiv Detail & Related papers (2022-07-27T07:50:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.