Related papers: Measuring Language Model Hallucinations Through Distributional Correctness

Measuring Language Model Hallucinations Through Distributional Correctness

URL: http://arxiv.org/abs/2510.04302v1
Date: Sun, 05 Oct 2025 17:50:42 GMT
Title: Measuring Language Model Hallucinations Through Distributional Correctness
Authors: Thomas F Burns,
Abstract summary: A novel evaluation metric, the Distributional Correctness Score (DCS), is introduced to solve this problem.<n>DCS distinguishes between harmful overconfidence in wrong answers and uncertainty expressed through abstention, providing scores in an interpretable default range.<n>DCS is demonstrated to offer a more nuanced and aligned evaluation paradigm that incentivises models to express genuine uncertainty rather than guessing.
Score: 7.106986689736826
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Common evaluation paradigms for language models focus on scoring single responses through accuracy metrics or proper scoring rules, failing to capture the full richness of a model's belief state. Recent work illustrates that language models hallucinate in-part because they are optimised to be good test-takers under binary scoring schemes that reward any answer over abstention. While this insight naturally leads to penalty-based approaches, they ignore crucial distinctions in how models distribute uncertainty, for example between hedging toward incorrect answers versus hedging toward "I don't know" responses. A novel evaluation metric, the Distributional Correctness Score (DCS), is introduced to solve this problem, i.e., of not considering a model's entire probability distribution over answer choices. DCS naturally distinguishes between harmful overconfidence in wrong answers and uncertainty expressed through abstention, providing scores in an interpretable default range. Through theoretical analysis and illustrative examples, DCS is demonstrated to offer a more nuanced and aligned evaluation paradigm that incentivises models to express genuine uncertainty rather than guessing. Adapting 12 existing evaluation benchmarks to DCS's variants and measuring performance on six language models reveals that for half of the tested benchmarks scores are negative across all tested models, indicating significant tendencies towards hallucination.

Related papers

On Calibration of Large Language Models: From Response To Capability [66.59139960234326]
Large language models (LLMs) are widely deployed as general-purpose problem solvers.<n>We introduce capability calibration, which targets the model's expected accuracy on a query.<n>Our results demonstrate that capability-calibrated confidence improves pass@$k$ prediction and inference budget allocation.
arXiv Detail & Related papers (2026-02-14T01:07:45Z)
The Illusion of Certainty: Uncertainty quantification for LLMs fails under ambiguity [48.899855816199484]
We introduce MAQA* and AmbigQA*, the first ambiguous question-answering (QA) datasets equipped with ground-truth answer distributions.<n>We show that predictive-distribution and ensemble-based estimators are fundamentally limited under ambiguity.
arXiv Detail & Related papers (2025-11-06T14:46:35Z)
Efficient semantic uncertainty quantification in language models via diversity-steered sampling [46.23327887393273]
We introduce a diversity-steered sampler that discourages semantically redundant outputs during decoding.<n>Key idea is to inject a continuous semantic-similarity penalty into the model's proposal distribution.<n>Being modular and requiring no gradient access to the base LLM, the framework promises to serve as a drop-in enhancement for uncertainty estimation.
arXiv Detail & Related papers (2025-10-24T10:06:21Z)
Reference-Specific Unlearning Metrics Can Hide the Truth: A Reality Check [60.77691669644931]
We propose Functional Alignment for Distributional Equivalence (FADE), a novel metric that measures distributional similarity between unlearned and reference models.<n>We show that FADE captures functional alignment across the entire output distribution, providing a principled assessment of genuine unlearning.<n>These findings expose fundamental gaps in current evaluation practices and demonstrate that FADE provides a more robust foundation for developing and assessing truly effective unlearning methods.
arXiv Detail & Related papers (2025-10-14T20:50:30Z)
Reference-Free Rating of LLM Responses via Latent Information [53.463883683503106]
We study the common practice of asking a judge model to assign Likert-scale scores to free-text responses.<n>We then propose and evaluate Latent Judges, which derive scalar ratings from internal model signals.<n>Across a broad suite of pairwise and single-rating benchmarks, latent methods match or surpass standard prompting.
arXiv Detail & Related papers (2025-09-29T12:15:52Z)
The NazoNazo Benchmark: A Cost-Effective and Extensible Test of Insight-Based Reasoning in LLMs [3.9977256267361754]
We present Nazonazo, a cost-effective benchmark built from Japanese children's riddles to test insight-based reasoning.<n>No model except for GPT-5 is comparable to human performance, which achieves a 52.9% mean accuracy.
arXiv Detail & Related papers (2025-09-18T07:50:04Z)
Conformal Linguistic Calibration: Trading-off between Factuality and Specificity [41.45862052156885]
We present a framework connecting abstention and linguistic calibration through the lens of linguistic pragmatics.<n>Results demonstrate our method produces calibrated outputs with conformal guarantees on factual accuracy.
arXiv Detail & Related papers (2025-02-26T13:01:49Z)
Semi-supervised Learning For Robust Speech Evaluation [30.593420641501968]
Speech evaluation measures a learners oral proficiency using automatic models. This paper proposes to address such challenges by exploiting semi-supervised pre-training and objective regularization. An anchor model is trained using pseudo labels to predict the correctness of pronunciation.
arXiv Detail & Related papers (2024-09-23T02:11:24Z)
Covert Bias: The Severity of Social Views' Unalignment in Language Models Towards Implicit and Explicit Opinion [0.40964539027092917]
We evaluate the severity of bias toward a view by using a biased model in edge cases of excessive bias scenarios. Our findings reveal a discrepancy in LLM performance in identifying implicit and explicit opinions, with a general tendency of bias toward explicit opinions of opposing stances. The direct, incautious responses of the unaligned models suggest a need for further refinement of decisiveness.
arXiv Detail & Related papers (2024-08-15T15:23:00Z)
Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions [75.45274978665684]
Vision-Language Understanding (VLU) benchmarks contain samples where answers rely on assumptions unsupported by the provided context.<n>We collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions.<n>We develop a general-purpose Context-AwaRe Abstention detector to identify samples lacking sufficient context and enhance model accuracy.
arXiv Detail & Related papers (2024-05-18T02:21:32Z)
Uncertainty in Language Models: Assessment through Rank-Calibration [65.10149293133846]
Language Models (LMs) have shown promising performance in natural language generation. It is crucial to correctly quantify their uncertainty in responding to given inputs. We develop a novel and practical framework, termed $Rank$-$Calibration$, to assess uncertainty and confidence measures for LMs.
arXiv Detail & Related papers (2024-04-04T02:31:05Z)
Augmentation by Counterfactual Explanation -- Fixing an Overconfident Classifier [11.233334009240947]
A highly accurate but overconfident model is ill-suited for deployment in critical applications such as healthcare and autonomous driving. This paper proposes an application of counterfactual explanations in fixing an over-confident classifier.
arXiv Detail & Related papers (2022-10-21T18:53:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.