Can LLMs capture stable human-generated sentence entropy measures?
- URL: http://arxiv.org/abs/2602.04570v1
- Date: Wed, 04 Feb 2026 13:57:23 GMT
- Title: Can LLMs capture stable human-generated sentence entropy measures?
- Authors: Estrella Pivel-Villanueva, Elisabeth Frederike Sterner, Franziska Knolle,
- Abstract summary: We implement a bootstrap-based convergence analysis that tracks how entropy estimates stabilize as a function of sample size.<n>90% of sentences converged after 111 responses in German and 81 responses in English.<n>Low-entropy sentences required as few as 20 responses and high-entropy sentences (>2.5) substantially more.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Predicting upcoming words is a core mechanism of language comprehension and may be quantified using Shannon entropy. There is currently no empirical consensus on how many human responses are required to obtain stable and unbiased entropy estimates at the word level. Moreover, large language models (LLMs) are increasingly used as substitutes for human norming data, yet their ability to reproduce stable human entropy remains unclear. Here, we address both issues using two large publicly available cloze datasets in German 1 and English 2. We implemented a bootstrap-based convergence analysis that tracks how entropy estimates stabilize as a function of sample size. Across both languages, more than 97% of sentences reached stable entropy estimates within the available sample sizes. 90% of sentences converged after 111 responses in German and 81 responses in English, while low-entropy sentences (<1) required as few as 20 responses and high-entropy sentences (>2.5) substantially more. These findings provide the first direct empirical validation for common norming practices and demonstrate that convergence critically depends on sentence predictability. We then compared stable human entropy values with entropy estimates derived from several LLMs, including GPT-4o, using both logit-based probability extraction and sampling-based frequency estimation, GPT2-xl/german-GPT-2, RoBERTa Base/GottBERT, and LLaMA 2 7B Chat. GPT-4o showed the highest correspondence with human data, although alignment depended strongly on the extraction method and prompt design. Logit-based estimates minimized absolute error, whereas sampling-based estimates were better in capturing the dispersion of human variability. Together, our results establish practical guidelines for human norming and show that while LLMs can approximate human entropy, they are not interchangeable with stable human-derived distributions.
Related papers
- Semantic Chunking and the Entropy of Natural Language [1.3592625530347717]
The entropy rate of printed English is famously estimated to be about one bit per character.<n>We introduce a statistical model that attempts to capture the intricate multi-scale structure of natural language.
arXiv Detail & Related papers (2026-02-13T18:58:10Z) - Clozing the Gap: Exploring Why Language Model Surprisal Outperforms Cloze Surprisal [7.591490481106253]
How predictable a word is can be quantified in two ways: using human responses to the cloze task or using probabilities from language models (LMs)<n>We present evidence for three hypotheses about the advantage of LM probabilities.
arXiv Detail & Related papers (2026-01-14T21:38:54Z) - Probabilities Are All You Need: A Probability-Only Approach to Uncertainty Estimation in Large Language Models [13.41454380481593]
Uncertainty estimation, often using predictive entropy estimation, is key to addressing this issue.<n>This paper proposes an efficient, training-free uncertainty estimation method that approximates predictive entropy using the responses' top-$K$ probabilities.
arXiv Detail & Related papers (2025-11-10T23:31:43Z) - REAL Sampling: Boosting Factuality and Diversity of Open-Ended Generation via Asymptotic Entropy [93.8400683020273]
Decoding methods for large language models (LLMs) usually struggle with the tradeoff between ensuring factuality and maintaining diversity.
We propose REAL sampling, a decoding method that improved factuality and diversity over nucleus sampling.
arXiv Detail & Related papers (2024-06-11T21:44:49Z) - Predict the Next Word: Humans exhibit uncertainty in this task and language models _____ [7.581259361859477]
Language models (LMs) are trained to assign probability to human-generated text.
We exploit this fact and evaluate the LM's ability to reproduce variability that humans exhibit in the 'next word prediction' task.
We assess GPT2, BLOOM and ChatGPT and find that they exhibit fairly low calibration to human uncertainty.
arXiv Detail & Related papers (2024-02-27T14:11:32Z) - Automatically measuring speech fluency in people with aphasia: first
achievements using read-speech data [55.84746218227712]
This study aims at assessing the relevance of a signalprocessingalgorithm, initially developed in the field of language acquisition, for the automatic measurement of speech fluency.
arXiv Detail & Related papers (2023-08-09T07:51:40Z) - Estimating the Entropy of Linguistic Distributions [75.20045001387685]
We study the empirical effectiveness of different entropy estimators for linguistic distributions.
We find evidence that the reported effect size is over-estimated due to over-reliance on poor entropy estimators.
arXiv Detail & Related papers (2022-04-04T13:36:46Z) - On the probability-quality paradox in language generation [76.69397802617064]
We analyze language generation through an information-theoretic lens.
We posit that human-like language should contain an amount of information close to the entropy of the distribution over natural strings.
arXiv Detail & Related papers (2022-03-31T17:43:53Z) - Evaluating Distributional Distortion in Neural Language Modeling [81.83408583979745]
A heavy-tail of rare events accounts for a significant amount of the total probability mass of distributions in language.
Standard language modeling metrics such as perplexity quantify the performance of language models (LM) in aggregate.
We develop a controlled evaluation scheme which uses generative models trained on natural data as artificial languages.
arXiv Detail & Related papers (2022-03-24T01:09:46Z) - Self-Training Sampling with Monolingual Data Uncertainty for Neural
Machine Translation [98.83925811122795]
We propose to improve the sampling procedure by selecting the most informative monolingual sentences to complement the parallel data.
We compute the uncertainty of monolingual sentences using the bilingual dictionary extracted from the parallel data.
Experimental results on large-scale WMT English$Rightarrow$German and English$Rightarrow$Chinese datasets demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2021-06-02T05:01:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.