What Scales in Cross-Entropy Scaling Law?
- URL: http://arxiv.org/abs/2510.04067v1
- Date: Sun, 05 Oct 2025 07:06:02 GMT
- Title: What Scales in Cross-Entropy Scaling Law?
- Authors: Junxi Yan, Zixi Wei, Jingtao Zhan, Qingyao Ai, Yiqun Liu,
- Abstract summary: We introduce a novel decomposition of cross-entropy into three parts: Error-Entropy, Self-Alignment, and Confidence.<n>We find that only error-entropy follows a robust power-law scaling, while the other two terms remain largely invariant.<n>Our findings establish the error-entropy scaling law as a more accurate description of model behavior.
- Score: 28.394154336032756
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The cross-entropy scaling law has long served as a key tool for guiding the development of large language models. It shows that cross-entropy loss decreases in a predictable power-law rate as the model size increases. However, recent evidence indicates that this law breaks down at very large scales: the loss decreases more slowly than expected, which causes significant trouble for developing large language models. In this paper, we hypothesize that the root cause lies in the fact that cross-entropy itself does not truly scale; instead, only one of its hidden components does. To investigate this, we introduce a novel decomposition of cross-entropy into three parts: Error-Entropy, Self-Alignment, and Confidence. We show both theoretically and empirically that this decomposition precisely captures the training dynamics and optimization objectives. Through extensive experiments on multiple datasets and 32 models spanning five orders of magnitude in size, we find that only error-entropy follows a robust power-law scaling, while the other two terms remain largely invariant. Moreover, error-entropy constitutes the dominant share of cross-entropy in small models but diminishes in proportion as models grow larger. This explains why the cross-entropy scaling law appears accurate at small scales but fails at very large ones. Our findings establish the error-entropy scaling law as a more accurate description of model behavior. We believe it will have wide applications in the training, understanding, and future development of large language models.
Related papers
- On the Entropy Calibration of Language Models [52.47557449370603]
We study the problem of entropy calibration, which asks whether a language model's entropy over generations matches its log loss on human text.<n>We find that the observed scaling behavior is similar to what is predicted by the simplified setting.<n>We prove that it is possible, if we assume access to a black box which can fit models to predict the future entropy of text.
arXiv Detail & Related papers (2025-11-15T00:33:03Z) - Relative-Based Scaling Law for Neural Language Models [26.899273082543612]
Scaling laws aim to accurately predict model performance across different scales.<n>Existing scaling-law studies almost exclusively rely on cross-entropy as the evaluation metric.<n>We propose the Relative-Based Scaling Law, which characterizes how RBP improves with increasing model size.
arXiv Detail & Related papers (2025-10-23T09:37:00Z) - Superposition Yields Robust Neural Scaling [9.278468089636547]
We study the origin of the neural scaling law -- the finding that loss decreases as a power law with model size.<n>We found that when superposition is weak, meaning only the most frequent features are represented without interference, the scaling of loss with model size depends on the underlying feature frequency.<n>We conclude that representation superposition is an important mechanism underlying the observed neural scaling laws.
arXiv Detail & Related papers (2025-05-15T16:18:13Z) - Compute-Optimal LLMs Provably Generalize Better With Scale [102.29926217670926]
We develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regime.<n>We introduce a novel, fully empirical Freedman-type martingale concentration that tightens existing bounds by accounting for the variance of the loss function.<n>We produce a scaling law for the generalization gap, with bounds that become predictably stronger with scale.
arXiv Detail & Related papers (2025-04-21T16:26:56Z) - Do Larger Language Models Generalize Better? A Scaling Law for Implicit Reasoning at Pretraining Time [73.22651918134808]
This work shows counterintuitive effects of model size scaling and provides new insights into the relationship between scaling and reasoning in language models (LMs)<n>We pretrain LMs from scratch on a synthetic implicit multihop reasoning environment designed to replicate the structure and distribution of real-world large-scale knowledge graphs.<n>We then assess the LMs' ability to complete the missing edges in the graph, which requires multi-hop reasoning that can be viewed as a simplification of implicit reasoning during real-world pretraining.
arXiv Detail & Related papers (2025-04-04T17:57:22Z) - A Tale of Tails: Model Collapse as a Change of Scaling Laws [11.6055501181235]
We ask: How will the scaling laws change in the inevitable regime where synthetic data makes its way into the training corpus?
We develop a theoretical framework of model collapse through the lens of scaling laws.
We discover a wide range of decay phenomena, analyzing loss of scaling, shifted scaling with number of generations, the ''un-learning" of skills, and grokking when mixing human and synthesized data.
arXiv Detail & Related papers (2024-02-10T21:06:34Z) - Selecting Large Language Model to Fine-tune via Rectified Scaling Law [74.84096546112215]
Given constrained resources, fine-tuning all models and making selections afterward is unrealistic.
We find that the fine-tuning scaling curve includes not just the well-known "power phase" but also the previously unobserved "pre-power phase"
By leveraging our law, we propose a novel LLM selection algorithm that selects the near-optimal model with hundreds of times less resource consumption.
arXiv Detail & Related papers (2024-02-04T01:55:00Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Scaling Laws for Autoregressive Generative Modeling [30.051804305320424]
We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image$leftarrow$text models, and mathematical problem solving.
In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law.
arXiv Detail & Related papers (2020-10-28T02:17:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.