Related papers: Failure to Mix: Large language models struggle to answer according to desired probability distributions

Failure to Mix: Large language models struggle to answer according to desired probability distributions

URL: http://arxiv.org/abs/2511.14630v1
Date: Tue, 18 Nov 2025 16:22:26 GMT
Title: Failure to Mix: Large language models struggle to answer according to desired probability distributions
Authors: Ivy Yuqian Yang, David Yu Zhang,
Abstract summary: Current AI benchmarks have objectively correct answers, and training large language models (LLMs) via reinforcement learning against these benchmarks discourages probabilistic exploration.<n>Here, we conducted systematic experiments requesting LLMs to produce outputs following simple probabilistic distributions, and found that all modern LLMs tested grossly fail to follow the distributions.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scientific idea generation and selection requires exploration following a target probability distribution. In contrast, current AI benchmarks have objectively correct answers, and training large language models (LLMs) via reinforcement learning against these benchmarks discourages probabilistic exploration. Here, we conducted systematic experiments requesting LLMs to produce outputs following simple probabilistic distributions, and found that all modern LLMs tested grossly fail to follow the distributions. For example, requesting a binary output of "1" 49% of the time produces an answer of "0" nearly 100% of the time. This step function-like behavior of near-exclusively generating the output with marginally highest probability even overrules even strong in-built LLM biases.

Related papers

Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions [50.1404916337174]
We present the first large-scale, statistically powered audit of native probabilistic sampling in large language models (LLMs)<n>We show that batch generation achieves only modest statistical validity, with a 13% median pass rate, while independent requests collapse almost entirely.<n>We conclude that current LLMs lack a functional internal sampler, necessitating the use of external tools for applications requiring statistical guarantees.
arXiv Detail & Related papers (2026-01-08T22:33:12Z)
Flipping Against All Odds: Reducing LLM Coin Flip Bias via Verbalized Rejection Sampling [59.133428586090226]
Large language models (LLMs) can often accurately describe probability distributions using natural language.<n>This mismatch limits their use in tasks requiring reliableity, such as Monte Carlo methods, agent-based simulations, and randomized decision-making.<n>We introduce Verbalized Rejection Sampling (VRS), a natural-language adaptation of classical rejection sampling.
arXiv Detail & Related papers (2025-06-11T17:59:58Z)
Distribution Prompting: Understanding the Expressivity of Language Models Through the Next-Token Distributions They Can Produce [16.149411731945968]
We show that some distributions are significantly harder to elicit than others.<n>We find that distributions with very low or very high entropy are easier to approximate than those with moderate entropy.
arXiv Detail & Related papers (2025-05-18T05:49:48Z)
Scalable Best-of-N Selection for Large Language Models via Self-Certainty [75.1351701045874]
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models (LLMs)<n>We propose self-certainty, a novel and efficient metric that leverages the inherent probability distribution of LLM outputs to estimate response quality without requiring external reward models.<n>Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities.
arXiv Detail & Related papers (2025-02-25T19:08:07Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Ensemble Learning for Heterogeneous Large Language Models with Deep Parallel Collaboration [39.35476224845088]
Large language models (LLMs) exhibit complementary strengths in various tasks, motivating the research of LLM ensembling. We propose a training-free ensemble framework DeePEn, fusing the informative probability distributions yielded by different LLMs at each decoding step.
arXiv Detail & Related papers (2024-04-19T08:52:22Z)
Do LLMs Play Dice? Exploring Probability Distribution Sampling in Large Language Models for Behavioral Simulation [73.58618024960968]
An increasing number of studies are employing large language models (LLMs) as agents to emulate the sequential decision-making processes of humans.<n>This arouses curiosity regarding the capacity of LLM agents to comprehend probability distributions.<n>Our analysis indicates that LLM agents can understand probabilities, but they struggle with probability sampling.
arXiv Detail & Related papers (2024-04-13T16:59:28Z)
Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve [21.55766758950951]
We make predictions about the strategies that large language models will adopt to solve next-word prediction tasks. We evaluate two LLMs on eleven tasks and find robust evidence that LLMs are influenced by probability. We conclude that we should not evaluate LLMs as if they are humans but should instead treat them as a distinct type of system.
arXiv Detail & Related papers (2023-09-24T13:35:28Z)
Evaluating Distributional Distortion in Neural Language Modeling [81.83408583979745]
A heavy-tail of rare events accounts for a significant amount of the total probability mass of distributions in language. Standard language modeling metrics such as perplexity quantify the performance of language models (LM) in aggregate. We develop a controlled evaluation scheme which uses generative models trained on natural data as artificial languages.
arXiv Detail & Related papers (2022-03-24T01:09:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.