Related papers: Distribution Prompting: Understanding the Expressivity of Language Models Through the Next-Token Distributions They Can Produce

Distribution Prompting: Understanding the Expressivity of Language Models Through the Next-Token Distributions They Can Produce

URL: http://arxiv.org/abs/2505.12244v1
Date: Sun, 18 May 2025 05:49:48 GMT
Title: Distribution Prompting: Understanding the Expressivity of Language Models Through the Next-Token Distributions They Can Produce
Authors: Haojin Wang, Zining Zhu, Freda Shi,
Abstract summary: We show that some distributions are significantly harder to elicit than others.<n>We find that distributions with very low or very high entropy are easier to approximate than those with moderate entropy.
Score: 10.369289331969098
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autoregressive neural language models (LMs) generate a probability distribution over tokens at each time step given a prompt. In this work, we attempt to systematically understand the probability distributions that LMs can produce, showing that some distributions are significantly harder to elicit than others. Specifically, for any target next-token distribution over the vocabulary, we attempt to find a prompt that induces the LM to output a distribution as close as possible to the target, using either soft or hard gradient-based prompt tuning. We find that (1) in general, distributions with very low or very high entropy are easier to approximate than those with moderate entropy; (2) among distributions with the same entropy, those containing ''outlier tokens'' are easier to approximate; (3) target distributions generated by LMs -- even LMs with different tokenizers -- are easier to approximate than randomly chosen targets. These results offer insights into the expressiveness of LMs and the challenges of using them as probability distribution proposers.

Related papers

Entropy-Aligned Decoding of LMs for Better Writing and Reasoning [21.971790771470324]
Language models (LMs) are trained on billions of tokens in an attempt to recover the true language distribution.<n>Currently, vanilla random sampling from LMs yields low quality generations.<n>We introduce EPIC, a hyper- parameter-free decoding approach that incorporates the entropy of future trajectories into LM decoding.
arXiv Detail & Related papers (2026-01-05T01:37:10Z)
Failure to Mix: Large language models struggle to answer according to desired probability distributions [0.0]
Current AI benchmarks have objectively correct answers, and training large language models (LLMs) via reinforcement learning against these benchmarks discourages probabilistic exploration.<n>Here, we conducted systematic experiments requesting LLMs to produce outputs following simple probabilistic distributions, and found that all modern LLMs tested grossly fail to follow the distributions.
arXiv Detail & Related papers (2025-11-18T16:22:26Z)
Optimal Inference Schedules for Masked Diffusion Models [16.774584258255768]
Masked diffusion model (MDM) is able to sample tokens out-of-order and, ostensibly, many tokens at once and in parallel.<n>We show that it is in general impossible to compete with it without strong a priori knowledge of the distribution.
arXiv Detail & Related papers (2025-11-06T18:38:24Z)
Not all tokens are created equal: Perplexity Attention Weighted Networks for AI generated text detection [49.15148871877941]
Next-token distribution outputs offer a theoretically appealing approach for detection of large language models (LLMs)<n>We propose the Perplexity Attention Weighted Network (PAWN), which uses the last hidden states of the LLM and positions to weight the sum of a series of features based on metrics from the next-token distribution across the sequence length.<n>PAWN shows competitive and even better performance in-distribution than the strongest baselines with a fraction of their trainable parameters.
arXiv Detail & Related papers (2025-01-07T17:00:49Z)
Label Distribution Learning using the Squared Neural Family on the Probability Simplex [15.680835401104247]
We propose a novel label distribution learning model SNEFY-LDL.<n>It estimates a probability distribution of all possible label distributions over the simplex.<n>It can be used to predict the ground-truth label distributions, construct label distribution confidence intervals, and measure the correlations between different labels.
arXiv Detail & Related papers (2024-12-10T09:12:02Z)
OD-Stega: LLM-Based Near-Imperceptible Steganography via Optimized Distributions [7.611860976107124]
We consider coverless steganography where a Large Language Model drives an arithmetic coding decoder to generate stego-texts. An efficient method should embed secret message bits in as few language tokens as possible, while still keeping the stego-text natural and fluent.
arXiv Detail & Related papers (2024-10-06T01:30:45Z)
What Are the Odds? Language Models Are Capable of Probabilistic Reasoning [23.487484744911995]
We focus on evaluating the probabilistic reasoning capabilities of language models (LMs) using idealized and real-world statistical distributions. We perform a systematic evaluation of state-of-the-art LMs on three tasks: estimating percentiles, drawing samples, and calculating probabilities.
arXiv Detail & Related papers (2024-06-18T17:51:24Z)
Do LLMs Play Dice? Exploring Probability Distribution Sampling in Large Language Models for Behavioral Simulation [73.58618024960968]
An increasing number of studies are employing large language models (LLMs) as agents to emulate the sequential decision-making processes of humans.<n>This arouses curiosity regarding the capacity of LLM agents to comprehend probability distributions.<n>Our analysis indicates that LLM agents can understand probabilities, but they struggle with probability sampling.
arXiv Detail & Related papers (2024-04-13T16:59:28Z)
How Does Independence Help Generalization? Sample Complexity of ERM on Product Distributions [5.553167334488855]
We show that even though Empirical Risk Minimization (ERM) needs an exponential number of samples to learn on product distributions, an algorithm designed specifically for product distributions is needed. This leads to the conclusion that a product distribution by itself does not make a learning problem easier -- an algorithm designed specifically for product distributions is needed.
arXiv Detail & Related papers (2022-12-13T08:14:32Z)
Score-Based Diffusion meets Annealed Importance Sampling [89.92133671626327]
Annealed Importance Sampling remains one of the most effective methods for marginal likelihood estimation. We leverage recent progress in score-based generative modeling to approximate the optimal extended target distribution for AIS proposals.
arXiv Detail & Related papers (2022-08-16T12:13:29Z)
Personalized Trajectory Prediction via Distribution Discrimination [78.69458579657189]
Trarimiy prediction is confronted with the dilemma to capture the multi-modal nature of future dynamics. We present a distribution discrimination (DisDis) method to predict personalized motion patterns. Our method can be integrated with existing multi-modal predictive models as a plug-and-play module.
arXiv Detail & Related papers (2021-07-29T17:42:12Z)
Robust Learning of Optimal Auctions [84.13356290199603]
We study the problem of learning revenue-optimal multi-bidder auctions from samples when the samples of bidders' valuations can be adversarially corrupted or drawn from distributions that are adversarially perturbed. We propose new algorithms that can learn a mechanism whose revenue is nearly optimal simultaneously for all true distributions'' that are $alpha$-close to the original distribution in Kolmogorov-Smirnov distance.
arXiv Detail & Related papers (2021-07-13T17:37:21Z)
Distributional Reinforcement Learning via Moment Matching [54.16108052278444]
We formulate a method that learns a finite set of statistics from each return distribution via neural networks. Our method can be interpreted as implicitly matching all orders of moments between a return distribution and its Bellman target. Experiments on the suite of Atari games show that our method outperforms the standard distributional RL baselines.
arXiv Detail & Related papers (2020-07-24T05:18:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.