Related papers: Certain but not Probable? Differentiating Certainty from Probability in LLM Token Outputs for Probabilistic Scenarios

Certain but not Probable? Differentiating Certainty from Probability in LLM Token Outputs for Probabilistic Scenarios

URL: http://arxiv.org/abs/2511.00620v1
Date: Sat, 01 Nov 2025 16:51:11 GMT
Title: Certain but not Probable? Differentiating Certainty from Probability in LLM Token Outputs for Probabilistic Scenarios
Authors: Autumn Toney-Wails, Ryan Wails,
Abstract summary: We investigate the relationship between token certainty and alignment with theoretical probability distributions in well-defined probabilistic scenarios.<n>We measure two dimensions: (1) response validity with respect to scenario constraints, and (2) alignment between token-level output probabilities and theoretical probabilities.<n>Our results indicate that, while both models achieve perfect in-domain response accuracy across all prompt scenarios, their token-level probability and entropy values consistently diverge from the corresponding theoretical distributions.
Score: 1.1510009152620668
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reliable uncertainty quantification (UQ) is essential for ensuring trustworthy downstream use of large language models, especially when they are deployed in decision-support and other knowledge-intensive applications. Model certainty can be estimated from token logits, with derived probability and entropy values offering insight into performance on the prompt task. However, this approach may be inadequate for probabilistic scenarios, where the probabilities of token outputs are expected to align with the theoretical probabilities of the possible outcomes. We investigate the relationship between token certainty and alignment with theoretical probability distributions in well-defined probabilistic scenarios. Using GPT-4.1 and DeepSeek-Chat, we evaluate model responses to ten prompts involving probability (e.g., roll a six-sided die), both with and without explicit probability cues in the prompt (e.g., roll a fair six-sided die). We measure two dimensions: (1) response validity with respect to scenario constraints, and (2) alignment between token-level output probabilities and theoretical probabilities. Our results indicate that, while both models achieve perfect in-domain response accuracy across all prompt scenarios, their token-level probability and entropy values consistently diverge from the corresponding theoretical distributions.

Related papers

Trajectory of Probabilities, Probability on Trajectories, and the Stochastic-Quantum Correspondence [0.0]
A lack of a clear distinction between these two probabilistic descriptions has given rise to conceptual difficulties.<n>We define probability dynamics and process families together with a precise notion of implementation that connects the two descriptions.<n>We show that implementations are generically non-unique, that every probability dynamics admits a Markovian implementation, and characterize when non-Markovian implementations are possible.
arXiv Detail & Related papers (2026-02-26T20:53:16Z)
Probabilities Are All You Need: A Probability-Only Approach to Uncertainty Estimation in Large Language Models [13.41454380481593]
Uncertainty estimation, often using predictive entropy estimation, is key to addressing this issue.<n>This paper proposes an efficient, training-free uncertainty estimation method that approximates predictive entropy using the responses' top-$K$ probabilities.
arXiv Detail & Related papers (2025-11-10T23:31:43Z)
Reasoning Under Uncertainty: Exploring Probabilistic Reasoning Capabilities of LLMs [47.20307724127832]
We present the first comprehensive study of the reasoning capabilities of large language models (LLMs)<n>We evaluate models on three carefully designed tasks, mode identification, maximum likelihood estimation, and sample generation.<n>Through empirical evaluations, we demonstrate that there exists a clear performance gap between smaller and larger models.
arXiv Detail & Related papers (2025-09-12T22:58:05Z)
Inv-Entropy: A Fully Probabilistic Framework for Uncertainty Quantification in Language Models [5.6672926445919165]
Large language models (LLMs) have transformed natural language processing, but their reliable deployment requires effective uncertainty quantification (UQ)<n>Existing UQ methods are often and lack a probabilistic foundation.<n>We propose a fully probabilistic framework based on an inverse model, which quantifies uncertainty by evaluating the diversity of the input space conditioned on a given output through systematic perturbations.
arXiv Detail & Related papers (2025-06-11T13:02:17Z)
Recovering Event Probabilities from Large Language Model Embeddings via Axiomatic Constraints [4.029252551781513]
We propose enforcing axiomatic constraints, such as the additive rule of probability theory, in the latent space learned by an extended variational autoencoder.<n>This approach enables event probabilities to naturally emerge in the latent space as the VAE learns to both reconstruct the original embeddings and predict the embeddings of semantically related events.
arXiv Detail & Related papers (2025-05-10T19:04:56Z)
BIRD: A Trustworthy Bayesian Inference Framework for Large Language Models [52.46248487458641]
Predictive models often need to work with incomplete information in real-world tasks.<n>Current large language models (LLMs) are insufficient for accurate estimations.<n>We propose BIRD, a novel probabilistic inference framework.
arXiv Detail & Related papers (2024-04-18T20:17:23Z)
User-defined Event Sampling and Uncertainty Quantification in Diffusion Models for Physical Dynamical Systems [49.75149094527068]
We show that diffusion models can be adapted to make predictions and provide uncertainty quantification for chaotic dynamical systems. We develop a probabilistic approximation scheme for the conditional score function which converges to the true distribution as the noise level decreases. We are able to sample conditionally on nonlinear userdefined events at inference time, and matches data statistics even when sampling from the tails of the distribution.
arXiv Detail & Related papers (2023-06-13T03:42:03Z)
Reconciling Individual Probability Forecasts [78.0074061846588]
We show that two parties who agree on the data cannot disagree on how to model individual probabilities. We conclude that although individual probabilities are unknowable, they are contestable via a computationally and data efficient process.
arXiv Detail & Related papers (2022-09-04T20:20:35Z)
Logical Credal Networks [87.25387518070411]
This paper introduces Logical Credal Networks, an expressive probabilistic logic that generalizes many prior models that combine logic and probability. We investigate its performance on maximum a posteriori inference tasks, including solving Mastermind games with uncertainty and detecting credit card fraud.
arXiv Detail & Related papers (2021-09-25T00:00:47Z)
Multivariate Probabilistic Regression with Natural Gradient Boosting [63.58097881421937]
We propose a Natural Gradient Boosting (NGBoost) approach based on nonparametrically modeling the conditional parameters of the multivariate predictive distribution. Our method is robust, works out-of-the-box without extensive tuning, is modular with respect to the assumed target distribution, and performs competitively in comparison to existing approaches.
arXiv Detail & Related papers (2021-06-07T17:44:49Z)
Handling Epistemic and Aleatory Uncertainties in Probabilistic Circuits [18.740781076082044]
We propose an approach to overcome the independence assumption behind most of the approaches dealing with a large class of probabilistic reasoning. We provide an algorithm for Bayesian learning from sparse, albeit complete, observations. Each leaf of such circuits is labelled with a beta-distributed random variable that provides us with an elegant framework for representing uncertain probabilities.
arXiv Detail & Related papers (2021-02-22T10:03:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.