MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels
- URL: http://arxiv.org/abs/2502.14268v1
- Date: Thu, 20 Feb 2025 05:09:29 GMT
- Title: MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels
- Authors: Xiaoou Liu, Zhen Lin, Longchao Da, Chacha Chen, Shubhendu Trivedi, Hua Wei,
- Abstract summary: Large Language Models (LLMs) require robust confidence estimation.<n>McQCA-Eval is an evaluation framework for assessing confidence measures in Natural Language Generation.
- Score: 16.300463494913593
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) require robust confidence estimation, particularly in critical domains like healthcare and law where unreliable outputs can lead to significant consequences. Despite much recent work in confidence estimation, current evaluation frameworks rely on correctness functions -- various heuristics that are often noisy, expensive, and possibly introduce systematic biases. These methodological weaknesses tend to distort evaluation metrics and thus the comparative ranking of confidence measures. We introduce MCQA-Eval, an evaluation framework for assessing confidence measures in Natural Language Generation (NLG) that eliminates dependence on an explicit correctness function by leveraging gold-standard correctness labels from multiple-choice datasets. MCQA-Eval enables systematic comparison of both internal state-based white-box (e.g. logit-based) and consistency-based black-box confidence measures, providing a unified evaluation methodology across different approaches. Through extensive experiments on multiple LLMs and widely used QA datasets, we report that MCQA-Eval provides efficient and more reliable assessments of confidence estimation methods than existing approaches.
Related papers
- Calibration Is Not Enough: Evaluating Confidence Estimation Under Language Variations [49.84786015324238]
Confidence estimation (CE) indicates how reliable the answers of large language models (LLMs) are, and can impact user trust and decision-making.<n>We present a comprehensive evaluation framework for CE that measures their confidence quality on three new aspects.<n>These include robustness of confidence against prompt perturbations, stability across semantic equivalent answers, and sensitivity to semantically different answers.
arXiv Detail & Related papers (2026-01-12T23:16:50Z) - Beyond Hallucinations: A Composite Score for Measuring Reliability in Open-Source Large Language Models [0.0]
Large Language Models (LLMs) are increasingly used in decision-critical domains such as healthcare, law, and finance.<n>They often make overconfident errors, degrade under input shifts, and lack clear uncertainty estimates.<n>We introduce the Composite Reliability Score (CRS), a unified framework that integrates calibration, robustness, and uncertainty quantification into a single interpretable metric.
arXiv Detail & Related papers (2025-12-30T08:07:28Z) - Systematic Evaluation of Uncertainty Estimation Methods in Large Language Models [1.8374839804848957]
We evaluate four approaches for confidence estimation in large language models (LLMs)<n>We conduct experiments on four question-answering tasks using a state-of-the-art open-source LLM.<n>Our results show that each uncertainty metric captures a different facet of model confidence and that the hybrid CoCoA approach yields the best reliability overall.
arXiv Detail & Related papers (2025-10-23T11:50:47Z) - TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them [58.04324690859212]
Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks.<n>We identify two fundamental types of inconsistencies: Score-Comparison Inconsistency and Pairwise Transitivity Inconsistency.<n>We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations.
arXiv Detail & Related papers (2025-09-25T13:04:29Z) - CCE: Confidence-Consistency Evaluation for Time Series Anomaly Detection [56.302586730134806]
We introduce Confidence-Consistency Evaluation (CCE), a novel evaluation metric.<n>CCE simultaneously measures prediction confidence and uncertainty consistency.<n>We also establish RankEval, a benchmark for comparing the ranking capabilities of various metrics.
arXiv Detail & Related papers (2025-09-01T03:38:38Z) - Mind the Generation Process: Fine-Grained Confidence Estimation During LLM Generation [63.49409574310576]
Large language models (LLMs) exhibit overconfidence, assigning high confidence scores to incorrect predictions.<n>We introduce FineCE, a novel confidence estimation method that delivers accurate, fine-grained confidence scores during text generation.<n>Our code and all baselines used in the paper are available on GitHub.
arXiv Detail & Related papers (2025-08-16T13:29:35Z) - Overconfidence in LLM-as-a-Judge: Diagnosis and Confidence-Driven Solution [20.607071807794195]
Large Language Models (LLMs) are widely used as automated judges, where practical value depends on both accuracy and trustworthy, risk-aware judgments.<n>Existing approaches predominantly focus on accuracy, overlooking the necessity of well-calibrated confidence.<n>We advocate a shift from accuracy-centric evaluation to confidence-driven, risk-aware LLM-as-a-Judge systems.
arXiv Detail & Related papers (2025-08-08T11:11:22Z) - LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models [51.55869466207234]
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting.<n>We introduce LLMEval-3, a framework for dynamic evaluation of LLMs.<n>LLEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run.
arXiv Detail & Related papers (2025-08-07T14:46:30Z) - ConfProBench: A Confidence Evaluation Benchmark for MLLM-Based Process Judges [15.47711837051754]
We evaluate 14 state-of-the-art MLLMs, including both proprietary and open-source models.<n>We propose ConfProBench, the first comprehensive benchmark designed to systematically evaluate the reliability of step-level confidence scores generated by MPJs.
arXiv Detail & Related papers (2025-08-06T16:00:19Z) - A Context-Aware Dual-Metric Framework for Confidence Estimation in Large Language Models [6.62851757612838]
Current confidence estimation methods for large language models (LLMs) neglect the relevance between responses and contextual information.<n>We propose CRUX, which integrates context faithfulness and consistency for confidence estimation via two novel metrics.<n> Experiments across three benchmark datasets demonstrate CRUX's effectiveness, achieving the highest AUROC than existing baselines.
arXiv Detail & Related papers (2025-08-01T12:58:34Z) - Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs [7.197702136906138]
We propose an uncertainty-aware fairness metric, UCerF, to enable a fine-grained evaluation of model fairness.<n> observing data size, diversity, and clarity issues in current datasets, we introduce a new gender-occupation fairness evaluation dataset.<n>We establish a benchmark, using our metric and dataset, and apply it to evaluate the behavior of ten open-source AI systems.
arXiv Detail & Related papers (2025-05-29T20:45:18Z) - Aurora: Are Android Malware Classifiers Reliable and Stable under Distribution Shift? [51.12297424766236]
AURORA is a framework to evaluate malware classifiers based on their confidence quality and operational resilience.<n>AURORA is complemented by a set of metrics designed to go beyond point-in-time performance.<n>The fragility in SOTA frameworks across datasets of varying drift suggests the need for a return to the whiteboard.
arXiv Detail & Related papers (2025-05-28T20:22:43Z) - On Verbalized Confidence Scores for LLMs [25.160810008907397]
Uncertainty quantification for large language models (LLMs) can establish more human trust into their responses.<n>This work focuses on asking the LLM itself to verbalize its uncertainty with a confidence score as part of its output tokens.<n>We assess the reliability of verbalized confidence scores with respect to different datasets, models, and prompt methods.
arXiv Detail & Related papers (2024-12-19T11:10:36Z) - Label-Confidence-Aware Uncertainty Estimation in Natural Language Generation [8.635811152610604]
Uncertainty Quantification (UQ) is crucial for ensuring the safety and robustness of AI systems.<n>We propose a label-confidence-aware (LCA) uncertainty estimation based on Kullback-Leibler divergence bridging between samples and label source.
arXiv Detail & Related papers (2024-12-10T07:35:23Z) - Black-box Uncertainty Quantification Method for LLM-as-a-Judge [13.45579129351493]
We introduce a novel method for quantifying uncertainty designed to enhance the trustworthiness of LLM-as-a-Judge evaluations.
The method quantifies uncertainty by analyzing the relationships between generated assessments and possible ratings.
By cross-evaluating these relationships and constructing a confusion matrix based on token probabilities, the method derives labels of high or low uncertainty.
arXiv Detail & Related papers (2024-10-15T13:29:22Z) - Confidence Estimation for LLM-Based Dialogue State Tracking [9.305763502526833]
Estimation of a model's confidence on its outputs is critical for Conversational AI systems based on large language models (LLMs)
We provide an exhaustive exploration of methods, including approaches proposed for open- and closed-weight LLMs.
Our findings suggest that fine-tuning open-weight LLMs can result in enhanced AUC performance, indicating better confidence score calibration.
arXiv Detail & Related papers (2024-09-15T06:44:26Z) - How Reliable are LLMs as Knowledge Bases? Re-thinking Facutality and Consistency [60.25969380388974]
Large Language Models (LLMs) are increasingly explored as knowledge bases (KBs)<n>Current evaluation methods focus too narrowly on knowledge retention, overlooking other crucial criteria for reliable performance.<n>We propose new criteria and metrics to quantify factuality and consistency, leading to a final reliability score.
arXiv Detail & Related papers (2024-07-18T15:20:18Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models [36.273451767886726]
FreeEval is a modular and scalable framework crafted to enable trustworthy and efficient automatic evaluations of large language models.
FreeEval's unified abstractions simplify the integration and improve the transparency of diverse evaluation methodologies.
The framework integrates meta-evaluation techniques like human evaluation and data contamination detection, which, along with dynamic evaluation modules, enhance the fairness of the evaluation outcomes.
arXiv Detail & Related papers (2024-04-09T04:17:51Z) - Revisiting Confidence Estimation: Towards Reliable Failure Prediction [53.79160907725975]
We find a general, widely existing but actually-neglected phenomenon that most confidence estimation methods are harmful for detecting misclassification errors.
We propose to enlarge the confidence gap by finding flat minima, which yields state-of-the-art failure prediction performance.
arXiv Detail & Related papers (2024-03-05T11:44:14Z) - TrustScore: Reference-Free Evaluation of LLM Response Trustworthiness [58.721012475577716]
Large Language Models (LLMs) have demonstrated impressive capabilities across various domains, prompting a surge in their practical applications.
This paper introduces TrustScore, a framework based on the concept of Behavioral Consistency, which evaluates whether an LLMs response aligns with its intrinsic knowledge.
arXiv Detail & Related papers (2024-02-19T21:12:14Z) - Binary Classification with Confidence Difference [100.08818204756093]
This paper delves into a novel weakly supervised binary classification problem called confidence-difference (ConfDiff) classification.
We propose a risk-consistent approach to tackle this problem and show that the estimation error bound the optimal convergence rate.
We also introduce a risk correction approach to mitigate overfitting problems, whose consistency and convergence rate are also proven.
arXiv Detail & Related papers (2023-10-09T11:44:50Z) - An evaluation of word-level confidence estimation for end-to-end
automatic speech recognition [70.61280174637913]
We investigate confidence estimation for end-to-end automatic speech recognition (ASR)
We provide an extensive benchmark of popular confidence methods on four well-known speech datasets.
Our results suggest a strong baseline can be obtained by scaling the logits by a learnt temperature.
arXiv Detail & Related papers (2021-01-14T09:51:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.