On the Entropy Calibration of Language Models
- URL: http://arxiv.org/abs/2511.11966v1
- Date: Sat, 15 Nov 2025 00:33:03 GMT
- Title: On the Entropy Calibration of Language Models
- Authors: Steven Cao, Gregory Valiant, Percy Liang,
- Abstract summary: We study the problem of entropy calibration, which asks whether a language model's entropy over generations matches its log loss on human text.<n>We find that the observed scaling behavior is similar to what is predicted by the simplified setting.<n>We prove that it is possible, if we assume access to a black box which can fit models to predict the future entropy of text.
- Score: 52.47557449370603
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the problem of entropy calibration, which asks whether a language model's entropy over generations matches its log loss on human text. Past work found that models are miscalibrated, with entropy per step increasing (and text quality decreasing) as generations grow longer. This error accumulation is a fundamental problem in autoregressive models, and the standard solution is to truncate the distribution, which improves text quality at the cost of diversity. In this paper, we ask: is miscalibration likely to improve with scale, and is it theoretically possible to calibrate without tradeoffs? To build intuition, we first study a simplified theoretical setting to characterize the scaling behavior of miscalibration with respect to dataset size. We find that the scaling behavior depends on the power law exponent of the data distribution -- in particular, for a power law exponent close to 1, the scaling exponent is close to 0, meaning that miscalibration improves very slowly with scale. Next, we measure miscalibration empirically in language models ranging from 0.5B to 70B parameters. We find that the observed scaling behavior is similar to what is predicted by the simplified setting: our fitted scaling exponents for text are close to 0, meaning that larger models accumulate error at a similar rate as smaller ones. This scaling (or, lack thereof) provides one explanation for why we sample from larger models with similar amounts of truncation as smaller models, even though the larger models are of higher quality. However, truncation is not a satisfying solution because it comes at the cost of increased log loss. In theory, is it even possible to reduce entropy while preserving log loss? We prove that it is possible, if we assume access to a black box which can fit models to predict the future entropy of text.
Related papers
- What Scales in Cross-Entropy Scaling Law? [28.394154336032756]
We introduce a novel decomposition of cross-entropy into three parts: Error-Entropy, Self-Alignment, and Confidence.<n>We find that only error-entropy follows a robust power-law scaling, while the other two terms remain largely invariant.<n>Our findings establish the error-entropy scaling law as a more accurate description of model behavior.
arXiv Detail & Related papers (2025-10-05T07:06:02Z) - Compute-Optimal LLMs Provably Generalize Better With Scale [102.29926217670926]
We develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regime.<n>We introduce a novel, fully empirical Freedman-type martingale concentration that tightens existing bounds by accounting for the variance of the loss function.<n>We produce a scaling law for the generalization gap, with bounds that become predictably stronger with scale.
arXiv Detail & Related papers (2025-04-21T16:26:56Z) - Do Larger Language Models Generalize Better? A Scaling Law for Implicit Reasoning at Pretraining Time [73.22651918134808]
This work shows counterintuitive effects of model size scaling and provides new insights into the relationship between scaling and reasoning in language models (LMs)<n>We pretrain LMs from scratch on a synthetic implicit multihop reasoning environment designed to replicate the structure and distribution of real-world large-scale knowledge graphs.<n>We then assess the LMs' ability to complete the missing edges in the graph, which requires multi-hop reasoning that can be viewed as a simplification of implicit reasoning during real-world pretraining.
arXiv Detail & Related papers (2025-04-04T17:57:22Z) - Scaling Laws in Linear Regression: Compute, Parameters, and Data [86.48154162485712]
We study the theory of scaling laws in an infinite dimensional linear regression setup.<n>We show that the reducible part of the test error is $Theta(-(a-1) + N-(a-1)/a)$.<n>Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.
arXiv Detail & Related papers (2024-06-12T17:53:29Z) - Towards Neural Scaling Laws on Graphs [54.435688297561015]
We investigate how the performance of deep graph models changes with model and dataset sizes.<n>For model scaling, we identify that despite the parameter numbers, the model depth also plays an important role in affecting the model scaling behaviors.<n>We reform the data scaling law with the number of nodes or edges as the metric to address the irregular graph sizes.
arXiv Detail & Related papers (2024-02-03T06:17:21Z) - Inverse scaling can become U-shaped [126.64521446943155]
Scaling up language models has been empirically shown to improve performance on a wide range of downstream tasks.
This paper takes a closer look at these inverse scaling tasks.
We evaluate models of up to 540B parameters, trained on five times more compute than those evaluated in the Inverse Scaling Prize.
arXiv Detail & Related papers (2022-11-03T17:26:44Z) - Scaling Laws for Neural Machine Translation [21.76567580425173]
We show that cross-entropy loss as a function of model size follows a certain scaling law.
We also investigate the relationship between the cross-entropy loss and the quality of the translations generated.
arXiv Detail & Related papers (2021-09-16T06:15:20Z) - Scaling Laws for Autoregressive Generative Modeling [30.051804305320424]
We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image$leftarrow$text models, and mathematical problem solving.
In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law.
arXiv Detail & Related papers (2020-10-28T02:17:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.