A Context-Aware Dual-Metric Framework for Confidence Estimation in Large Language Models
- URL: http://arxiv.org/abs/2508.00600v1
- Date: Fri, 01 Aug 2025 12:58:34 GMT
- Title: A Context-Aware Dual-Metric Framework for Confidence Estimation in Large Language Models
- Authors: Mingruo Yuan, Shuyi Zhang, Ben Kao,
- Abstract summary: Current confidence estimation methods for large language models (LLMs) neglect the relevance between responses and contextual information.<n>We propose CRUX, which integrates context faithfulness and consistency for confidence estimation via two novel metrics.<n> Experiments across three benchmark datasets demonstrate CRUX's effectiveness, achieving the highest AUROC than existing baselines.
- Score: 6.62851757612838
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate confidence estimation is essential for trustworthy large language models (LLMs) systems, as it empowers the user to determine when to trust outputs and enables reliable deployment in safety-critical applications. Current confidence estimation methods for LLMs neglect the relevance between responses and contextual information, a crucial factor in output quality evaluation, particularly in scenarios where background knowledge is provided. To bridge this gap, we propose CRUX (Context-aware entropy Reduction and Unified consistency eXamination), the first framework that integrates context faithfulness and consistency for confidence estimation via two novel metrics. First, contextual entropy reduction represents data uncertainty with the information gain through contrastive sampling with and without context. Second, unified consistency examination captures potential model uncertainty through the global consistency of the generated answers with and without context. Experiments across three benchmark datasets (CoQA, SQuAD, QuAC) and two domain-specific datasets (BioASQ, EduQG) demonstrate CRUX's effectiveness, achieving the highest AUROC than existing baselines.
Related papers
- Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval [60.25608870901428]
Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs)<n>We propose the task of fact-checking without retrieval, focusing on the verification of arbitrary natural language claims, independent of their source robustness.
arXiv Detail & Related papers (2026-03-05T18:42:51Z) - Efficient Self-Evaluation for Diffusion Language Models via Sequence Regeneration [48.19579266939883]
Diffusion large language models (dLLMs) have attracted significant attention for their ability to enhance diversity, controllability, and parallelism.<n>We propose DiSE, a simple yet effective self-evaluation confidence quantification method for dLLMs.
arXiv Detail & Related papers (2026-03-03T08:58:20Z) - Trust in One Round: Confidence Estimation for Large Language Models via Structural Signals [13.89434979851652]
Large language models (LLMs) are increasingly deployed in domains where errors carry high social, scientific, or safety costs.<n>We present Structural Confidence, a single-pass, model-agnostic framework that enhances output correctness prediction.
arXiv Detail & Related papers (2026-02-01T02:35:59Z) - Towards Reliable Medical LLMs: Benchmarking and Enhancing Confidence Estimation of Large Language Models in Medical Consultation [97.36081721024728]
We propose the first benchmark for assessing confidence in multi-turn interaction during realistic medical consultations.<n>Our benchmark unifies three types of medical data for open-ended diagnostic generation.<n>We present MedConf, an evidence-grounded linguistic self-assessment framework.
arXiv Detail & Related papers (2026-01-22T04:51:39Z) - Open-World Deepfake Attribution via Confidence-Aware Asymmetric Learning [78.92934995292113]
We propose a Confidence-Aware Asymmetric Learning (CAL) framework, which balances confidence across known and novel forgery types.<n>CAL consistently outperforms previous methods, achieving new state-of-the-art performance on both known and novel forgery attribution.
arXiv Detail & Related papers (2025-12-14T12:31:28Z) - LLM-Centric RAG with Multi-Granular Indexing and Confidence Constraints [5.2604064919135896]
This paper addresses the issues of insufficient coverage, unstable results, and limited reliability in retrieval-augmented generation under complex knowledge environments.<n>It proposes a confidence control method that integrates multi-granularity memory indexing with uncertainty estimation.<n>The results show that the method achieves superior performance over existing models in QA accuracy, retrieval recall, ranking quality, and factual consistency.
arXiv Detail & Related papers (2025-10-30T23:48:37Z) - Mind the Generation Process: Fine-Grained Confidence Estimation During LLM Generation [63.49409574310576]
Large language models (LLMs) exhibit overconfidence, assigning high confidence scores to incorrect predictions.<n>We introduce FineCE, a novel confidence estimation method that delivers accurate, fine-grained confidence scores during text generation.<n>Our code and all baselines used in the paper are available on GitHub.
arXiv Detail & Related papers (2025-08-16T13:29:35Z) - Understanding and Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding [59.50808215134678]
This study introduces Trust-videoLLMs, a first comprehensive benchmark evaluating 23 state-of-the-art videoLLMs.<n>Results reveal significant limitations in dynamic scene comprehension, cross-modal resilience and real-world risk mitigation.
arXiv Detail & Related papers (2025-06-14T04:04:54Z) - Divide-Then-Align: Honest Alignment based on the Knowledge Boundary of RAG [51.120170062795566]
We propose Divide-Then-Align (DTA) to endow RAG systems with the ability to respond with "I don't know" when the query is out of the knowledge boundary.<n>DTA balances accuracy with appropriate abstention, enhancing the reliability and trustworthiness of retrieval-augmented systems.
arXiv Detail & Related papers (2025-05-27T08:21:21Z) - Data-Driven Calibration of Prediction Sets in Large Vision-Language Models Based on Inductive Conformal Prediction [0.0]
We propose a model-agnostic uncertainty quantification method that integrates dynamic threshold calibration and cross-modal consistency verification.<n>We show that the framework achieves stable performance across varying calibration-to-test split ratios, underscoring its robustness for real-world deployment in healthcare, autonomous systems, and other safety-sensitive domains.<n>This work bridges the gap between theoretical reliability and practical applicability in multi-modal AI systems, offering a scalable solution for hallucination detection and uncertainty-aware decision-making.
arXiv Detail & Related papers (2025-04-24T15:39:46Z) - SConU: Selective Conformal Uncertainty in Large Language Models [59.25881667640868]
We propose a novel approach termed Selective Conformal Uncertainty (SConU)<n>We develop two conformal p-values that are instrumental in determining whether a given sample deviates from the uncertainty distribution of the calibration set at a specific manageable risk level.<n>Our approach not only facilitates rigorous management of miscoverage rates across both single-domain and interdisciplinary contexts, but also enhances the efficiency of predictions.
arXiv Detail & Related papers (2025-04-19T03:01:45Z) - ParamMute: Suppressing Knowledge-Critical FFNs for Faithful Retrieval-Augmented Generation [91.20492150248106]
We investigate the internal mechanisms behind unfaithful generation and identify a subset of mid-to-deep feed-forward networks (FFNs) that are disproportionately activated in such cases.<n>We propose Parametric Knowledge Muting through FFN Suppression (ParamMute), a framework that improves contextual faithfulness by suppressing the activation of unfaithfulness-associated FFNs.<n> Experimental results show that ParamMute significantly enhances faithfulness across both CoFaithfulQA and the established ConFiQA benchmark, achieving substantial reductions in reliance on parametric memory.
arXiv Detail & Related papers (2025-02-21T15:50:41Z) - MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels [16.300463494913593]
Large Language Models (LLMs) require robust confidence estimation.<n>McQCA-Eval is an evaluation framework for assessing confidence measures in Natural Language Generation.
arXiv Detail & Related papers (2025-02-20T05:09:29Z) - To Trust or Not to Trust? Enhancing Large Language Models' Situated Faithfulness to External Contexts [10.748768620243982]
Large Language Models (LLMs) are often augmented with external contexts, such as those used in retrieval-augmented generation (RAG)<n>We show that when provided with both correct and incorrect contexts, both open-source and proprietary models tend to overly rely on external information.<n>We propose two approaches: Self-Guided Confidence Reasoning (SCR) and Rule-Based Confidence Reasoning (RCR)
arXiv Detail & Related papers (2024-10-18T17:59:47Z) - Confidence Under the Hood: An Investigation into the Confidence-Probability Alignment in Large Language Models [14.5291643644017]
We introduce the concept of Confidence-Probability Alignment.
We probe the alignment between models' internal and expressed confidence.
Among the models analyzed, OpenAI's GPT-4 showed the strongest confidence-probability alignment.
arXiv Detail & Related papers (2024-05-25T15:42:04Z) - GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models.
We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench.
GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z) - A Survey on Uncertainty Toolkits for Deep Learning [3.113304966059062]
We present the first survey on toolkits for uncertainty estimation in deep learning (DL)
We investigate 11 toolkits with respect to modeling and evaluation capabilities.
While the first two provide a large degree of flexibility and seamless integration into their respective framework, the last one has the larger methodological scope.
arXiv Detail & Related papers (2022-05-02T17:23:06Z) - An evaluation of word-level confidence estimation for end-to-end
automatic speech recognition [70.61280174637913]
We investigate confidence estimation for end-to-end automatic speech recognition (ASR)
We provide an extensive benchmark of popular confidence methods on four well-known speech datasets.
Our results suggest a strong baseline can be obtained by scaling the logits by a learnt temperature.
arXiv Detail & Related papers (2021-01-14T09:51:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.