Related papers: CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought

CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought

URL: http://arxiv.org/abs/2502.17214v1
Date: Mon, 24 Feb 2025 14:48:06 GMT
Title: CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought
Authors: Boxuan Zhang, Ruqi Zhang,
Abstract summary: Large language models (LLMs) excel in many tasks but struggle to accurately quantify uncertainty in their generated responses.<n>Existing uncertainty quantification (UQ) methods for LLMs are primarily prompt-wise rather than response-wise, which incurs high computational costs.<n>We propose CoT-UQ, a response-wise UQ framework that integrates LLMs' inherent reasoning capabilities through Chain-of-Thought (CoT) into the UQ process.
Score: 10.166370877826486
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) excel in many tasks but struggle to accurately quantify uncertainty in their generated responses. This limitation makes it challenging to detect misinformation and ensure reliable decision-making. Existing uncertainty quantification (UQ) methods for LLMs are primarily prompt-wise rather than response-wise, often requiring multiple response samples, which incurs high computational costs. Moreover, LLMs have been shown to be overconfident, particularly when using reasoning steps to derive their answers. In this work, we propose CoT-UQ, a response-wise UQ framework that integrates LLMs' inherent reasoning capabilities through Chain-of-Thought (CoT) into the UQ process. CoT-UQ captures critical information during inference by extracting keywords from each reasoning step and assessing their importance to the final answer. This key reasoning information is then aggregated to produce a final uncertainty estimate. We conduct extensive experiments based on LLaMA Family with model sizes varying from 8B to 13B across logical and mathematical reasoning tasks. Experimental results demonstrate that CoT-UQ significantly outperforms existing UQ methods, achieving an average improvement of 5.9% AUROC compared to current UQ methods. The code is available at: https://github.com/ZBox1005/CoT-UQ.

Related papers

Inferential Question Answering [67.54465021408724]
We introduce Inferential QA -- a new task that challenges models to infer answers from answer-supporting passages which provide only clues.<n>To study this problem, we construct QUIT (QUestions requiring Inference from Texts) dataset, comprising 7,401 questions and 2.4M passages.<n>We show that methods effective on traditional QA tasks struggle in inferential QA: retrievers underperform, rerankers offer limited gains, and fine-tuning provides inconsistent improvements.
arXiv Detail & Related papers (2026-02-01T14:02:43Z)
Benchmarking Uncertainty Calibration in Large Language Model Long-Form Question Answering [7.1559850008795385]
Large Language Models (LLMs) are commonly used in Question Answering (QA) settings.<n>Existing UQ approaches remain weakly validated in scientific QA.<n>We introduce the first large-scale benchmark for evaluating UQ metrics in reasoning-demanding QA.
arXiv Detail & Related papers (2026-01-30T20:02:34Z)
Evaluating Uncertainty Quantification Methods in Argumentative Large Language Models [24.97354151540176]
We conduct experiments to evaluate ArgLLMs' performance on claim verification tasks when using different UQ methods.<n>Our results demonstrate that, despite its simplicity, direct prompting is an effective UQ strategy in ArgLLMs.
arXiv Detail & Related papers (2025-09-26T19:59:09Z)
CCQA: Generating Question from Solution Can Improve Inference-Time Reasoning in SLMs [14.97707719362011]
We propose textbfCycle-textbfConsistency in textbfQuestion textbfAnswering (CCQA)<n>Inspired by cycle consistency, CCQA generates a question from each reasoning path and answer, evaluates each by its similarity to the original question, and then selects the candidate solution with the highest similarity score as the final response.<n>It is verified that CCQA consistently outperforms existing state-of-the-art (SOTA) methods across eight models on mathematical and commonsense reasoning benchmarks.
arXiv Detail & Related papers (2025-09-23T02:01:03Z)
Language Model Uncertainty Quantification with Attention Chain [9.093726246465117]
A large language model's (LLM) predictive uncertainty is crucial for judging the reliability of its answers. We propose UQAC, an efficient method that narrows the reasoning space to a tractable size for marginalization. We validate UQAC on multiple reasoning benchmarks with advanced open-source LLMs.
arXiv Detail & Related papers (2025-03-24T21:43:47Z)
Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA) In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons.
arXiv Detail & Related papers (2025-03-19T08:45:03Z)
Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models [76.17975723711886]
Uncertainty quantification (UQ) is a prominent approach for eliciting truthful answers from large language models (LLMs)<n>In this work, we adapt Mahalanobis Distance (MD) - a well-established UQ technique in classification tasks - for text generation.<n>Our method extracts token embeddings from multiple layers of LLMs, computes MD scores for each token, and uses linear regression trained on these features to provide robust uncertainty scores.
arXiv Detail & Related papers (2025-02-20T10:25:13Z)
Uncertainty Quantification for LLMs through Minimum Bayes Risk: Bridging Confidence and Consistency [66.96286531087549]
Uncertainty quantification (UQ) methods for Large Language Models (LLMs) encompass a variety of approaches.<n>We propose a novel approach to integrating model confidence with output consistency, resulting in a family of efficient and robust UQ methods.<n>We evaluate our approach across various tasks such as question answering, abstractive summarization, and machine translation.
arXiv Detail & Related papers (2025-02-07T14:30:12Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification is a key element of machine learning applications.<n>We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines.<n>We conduct a large-scale empirical investigation of UQ and normalization techniques across eleven tasks, identifying the most effective approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z)
LUQ: Long-text Uncertainty Quantification for LLMs [29.987010627250527]
Large Language Models (LLMs) are prone to generate nonfactual content. Uncertainty Quantification (UQ) is pivotal in enhancing our understanding of a model's confidence on its generation. We propose textscLuq-Ensemble, a method that ensembles responses from multiple models and selects the response with the lowest uncertainty.
arXiv Detail & Related papers (2024-03-29T16:49:24Z)
Can multiple-choice questions really be useful in detecting the abilities of LLMs? [15.756543037102256]
Multiple-choice questions (MCQs) are widely used in the evaluation of large language models (LLMs) The misalignment between the task and the evaluation method demands a thoughtful analysis of MCQ's efficacy. We evaluate nine LLMs on four question-answering (QA) datasets in two languages: Chinese and English.
arXiv Detail & Related papers (2024-03-26T14:43:48Z)
DQ-LoRe: Dual Queries with Low Rank Approximation Re-ranking for In-Context Learning [66.85379279041128]
In this study, we introduce a framework that leverages Dual Queries and Low-rank approximation Re-ranking to automatically select exemplars for in-context learning. DQ-LoRe significantly outperforms prior state-of-the-art methods in the automatic selection of exemplars for GPT-4, enhancing performance from 92.5% to 94.2%.
arXiv Detail & Related papers (2023-10-04T16:44:37Z)
Knowledge-Driven CoT: Exploring Faithful Reasoning in LLMs for Knowledge-intensive Question Answering [17.672572064705445]
Large language models (LLMs) equipped with Chain-of-Thought (CoT) have shown impressive reasoning ability in various downstream tasks. We propose a framework called Knowledge-Driven Chain-of-Thought (KD-CoT) to verify and modify reasoning traces in CoT via interaction with external knowledge.
arXiv Detail & Related papers (2023-08-25T09:23:55Z)
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed Large Language Model Signals for Science Question Answering [59.63860993280275]
Large Language Models (LLMs) have demonstrated exceptional performance in various Natural Language Processing (NLP) tasks. We propose a novel method termed T-SciQ that aims at teaching science question answering with LLM signals. Our approach achieves a new state-of-the-art performance on the ScienceQA benchmark, with an accuracy of 96.18%.
arXiv Detail & Related papers (2023-05-05T11:56:30Z)
Active Prompting with Chain-of-Thought for Large Language Models [26.5029080638055]
This paper proposes a new method, Active-Prompt, to adapt large language models to different tasks. By borrowing ideas from the related problem of uncertainty-based active learning, we introduce several metrics to characterize the uncertainty. Experimental results demonstrate the superiority of our proposed method, achieving state-of-the-art on eight complex reasoning tasks.
arXiv Detail & Related papers (2023-02-23T18:58:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.