MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels
- URL: http://arxiv.org/abs/2502.14268v1
- Date: Thu, 20 Feb 2025 05:09:29 GMT
- Title: MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels
- Authors: Xiaoou Liu, Zhen Lin, Longchao Da, Chacha Chen, Shubhendu Trivedi, Hua Wei,
- Abstract summary: Large Language Models (LLMs) require robust confidence estimation.
McQCA-Eval is an evaluation framework for assessing confidence measures in Natural Language Generation.
- Score: 16.300463494913593
- License:
- Abstract: Large Language Models (LLMs) require robust confidence estimation, particularly in critical domains like healthcare and law where unreliable outputs can lead to significant consequences. Despite much recent work in confidence estimation, current evaluation frameworks rely on correctness functions -- various heuristics that are often noisy, expensive, and possibly introduce systematic biases. These methodological weaknesses tend to distort evaluation metrics and thus the comparative ranking of confidence measures. We introduce MCQA-Eval, an evaluation framework for assessing confidence measures in Natural Language Generation (NLG) that eliminates dependence on an explicit correctness function by leveraging gold-standard correctness labels from multiple-choice datasets. MCQA-Eval enables systematic comparison of both internal state-based white-box (e.g. logit-based) and consistency-based black-box confidence measures, providing a unified evaluation methodology across different approaches. Through extensive experiments on multiple LLMs and widely used QA datasets, we report that MCQA-Eval provides efficient and more reliable assessments of confidence estimation methods than existing approaches.
Related papers
- On Verbalized Confidence Scores for LLMs [25.160810008907397]
Uncertainty quantification for large language models (LLMs) can establish more human trust into their responses.
This work focuses on asking the LLM itself to verbalize its uncertainty with a confidence score as part of its output tokens.
We assess the reliability of verbalized confidence scores with respect to different datasets, models, and prompt methods.
arXiv Detail & Related papers (2024-12-19T11:10:36Z) - Label-Confidence-Aware Uncertainty Estimation in Natural Language Generation [8.635811152610604]
Uncertainty Quantification (UQ) is crucial for ensuring the safety and robustness of AI systems.
We propose a label-confidence-aware (LCA) uncertainty estimation based on Kullback-Leibler divergence bridging between samples and label source.
arXiv Detail & Related papers (2024-12-10T07:35:23Z) - Black-box Uncertainty Quantification Method for LLM-as-a-Judge [13.45579129351493]
We introduce a novel method for quantifying uncertainty designed to enhance the trustworthiness of LLM-as-a-Judge evaluations.
The method quantifies uncertainty by analyzing the relationships between generated assessments and possible ratings.
By cross-evaluating these relationships and constructing a confusion matrix based on token probabilities, the method derives labels of high or low uncertainty.
arXiv Detail & Related papers (2024-10-15T13:29:22Z) - Confidence Estimation for LLM-Based Dialogue State Tracking [9.305763502526833]
Estimation of a model's confidence on its outputs is critical for Conversational AI systems based on large language models (LLMs)
We provide an exhaustive exploration of methods, including approaches proposed for open- and closed-weight LLMs.
Our findings suggest that fine-tuning open-weight LLMs can result in enhanced AUC performance, indicating better confidence score calibration.
arXiv Detail & Related papers (2024-09-15T06:44:26Z) - How Reliable are LLMs as Knowledge Bases? Re-thinking Facutality and Consistency [60.25969380388974]
Large Language Models (LLMs) are increasingly explored as knowledge bases (KBs)
Current evaluation methods focus too narrowly on knowledge retention, overlooking other crucial criteria for reliable performance.
We propose new criteria and metrics to quantify factuality and consistency, leading to a final reliability score.
arXiv Detail & Related papers (2024-07-18T15:20:18Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models [36.273451767886726]
FreeEval is a modular and scalable framework crafted to enable trustworthy and efficient automatic evaluations of large language models.
FreeEval's unified abstractions simplify the integration and improve the transparency of diverse evaluation methodologies.
The framework integrates meta-evaluation techniques like human evaluation and data contamination detection, which, along with dynamic evaluation modules, enhance the fairness of the evaluation outcomes.
arXiv Detail & Related papers (2024-04-09T04:17:51Z) - Revisiting Confidence Estimation: Towards Reliable Failure Prediction [53.79160907725975]
We find a general, widely existing but actually-neglected phenomenon that most confidence estimation methods are harmful for detecting misclassification errors.
We propose to enlarge the confidence gap by finding flat minima, which yields state-of-the-art failure prediction performance.
arXiv Detail & Related papers (2024-03-05T11:44:14Z) - TrustScore: Reference-Free Evaluation of LLM Response Trustworthiness [58.721012475577716]
Large Language Models (LLMs) have demonstrated impressive capabilities across various domains, prompting a surge in their practical applications.
This paper introduces TrustScore, a framework based on the concept of Behavioral Consistency, which evaluates whether an LLMs response aligns with its intrinsic knowledge.
arXiv Detail & Related papers (2024-02-19T21:12:14Z) - Binary Classification with Confidence Difference [100.08818204756093]
This paper delves into a novel weakly supervised binary classification problem called confidence-difference (ConfDiff) classification.
We propose a risk-consistent approach to tackle this problem and show that the estimation error bound the optimal convergence rate.
We also introduce a risk correction approach to mitigate overfitting problems, whose consistency and convergence rate are also proven.
arXiv Detail & Related papers (2023-10-09T11:44:50Z) - An evaluation of word-level confidence estimation for end-to-end
automatic speech recognition [70.61280174637913]
We investigate confidence estimation for end-to-end automatic speech recognition (ASR)
We provide an extensive benchmark of popular confidence methods on four well-known speech datasets.
Our results suggest a strong baseline can be obtained by scaling the logits by a learnt temperature.
arXiv Detail & Related papers (2021-01-14T09:51:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.