Universal One-third Time Scaling in Learning Peaked Distributions
- URL: http://arxiv.org/abs/2602.03685v1
- Date: Tue, 03 Feb 2026 16:06:18 GMT
- Title: Universal One-third Time Scaling in Learning Peaked Distributions
- Authors: Yizhou Liu, Ziming Liu, Cengiz Pehlevan, Jeff Gore,
- Abstract summary: Training large language models (LLMs) is computationally expensive, partly because the loss exhibits slow power-law convergence.<n>We show that this behavior can arise intrinsically from the use of softmax and cross-entropy.
- Score: 48.44706450307606
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training large language models (LLMs) is computationally expensive, partly because the loss exhibits slow power-law convergence whose origin remains debatable. Through systematic analysis of toy models and empirical evaluation of LLMs, we show that this behavior can arise intrinsically from the use of softmax and cross-entropy. When learning peaked probability distributions, e.g., next-token distributions, these components yield power-law vanishing losses and gradients, creating a fundamental optimization bottleneck. This ultimately leads to power-law time scaling of the loss with a universal exponent of $1/3$. Our results provide a mechanistic explanation for observed neural scaling and suggest new directions for improving LLM training efficiency.
Related papers
- Data Distribution as a Lever for Guiding Optimizers Toward Superior Generalization in LLMs [60.68927774057402]
We show, for the first time, that a lower simplicity bias induces a better generalization.<n>Motivated by this insight, we demonstrate that the training data distribution by upsampling or augmenting examples learned later in training similarly reduces SB and leads to improved generalization.<n>Our strategy improves the performance of multiple language models including Phi2-2.7B, Llama3.2-1B, Gemma3-1B-PT, Qwen3-0.6B-Base-achieving relative accuracy gains up to 18% when fine-tuned with AdamW and Muon.
arXiv Detail & Related papers (2026-01-31T07:40:36Z) - The Art of Scaling Reinforcement Learning Compute for LLMs [52.71086085139566]
Reinforcement learning (RL) has become central to training large language models.<n>Despite rapidly rising compute budgets, there is no principled understanding of how to evaluate algorithmic improvements for scaling RL compute.<n>We present the first large-scale systematic study, amounting to more than 400,000 GPU-hours.
arXiv Detail & Related papers (2025-10-15T17:43:03Z) - Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules [9.332823269318842]
Scaling laws have emerged as a unifying lens for understanding and guiding the training of large language models.<n>We establish a Functional Scaling Law that captures the full loss trajectory under arbitrary LRSs.<n>We derive explicit scaling relations in both data- and compute-limited regimes.
arXiv Detail & Related papers (2025-09-23T16:05:16Z) - J1: Exploring Simple Test-Time Scaling for LLM-as-a-Judge [24.607213170485743]
This paper introduces $textbfJ1-7B$, which is first supervised fine-tuned on reflection-enhanced datasets collected via rejection-sampling.<n>At inference time, we apply Simple Test-Time Scaling (STTS) strategies for additional performance improvement.<n> Experimental results demonstrate that $textbfJ1-7B$ surpasses the previous state-of-the-art LLM-as-a-Judge by $ textbf4.8$% and exhibits a $ textbf5.1$% stronger scaling trend under STTS.
arXiv Detail & Related papers (2025-05-17T06:58:42Z) - Scaling Laws for Predicting Downstream Performance in LLMs [75.28559015477137]
This work focuses on the pre-training loss as a more computation-efficient metric for performance estimation.<n>We present FLP-M, a fundamental approach for performance prediction that addresses the practical need to integrate datasets from multiple sources during pre-training.
arXiv Detail & Related papers (2024-10-11T04:57:48Z) - LEMON: Lossless model expansion [43.40389747029802]
Scaling of deep neural networks, especially Transformers, is pivotal for their surging performance.
We present $textbfL$ossl$textbfE$ss $textbfMO$del Expansio$textbfN$ (LEMON), a recipe to initialize scaled models using the weights of their smaller but pre-trained counterparts.
We show that LEMON reduces computational costs by 56.7% for Vision Transformers and 33.2% for BERT when compared to training from scratch.
arXiv Detail & Related papers (2023-10-12T03:02:41Z) - Distributional Reinforcement Learning with Dual Expectile-Quantile Regression [51.87411935256015]
quantile regression approach to distributional RL provides flexible and effective way of learning arbitrary return distributions.<n>We show that distributional estimation guarantees vanish, and we empirically observe that the estimated distribution rapidly collapses to its mean.<n>Motivated by the efficiency of $L$-based learning, we propose to jointly learn expectiles and quantiles of the return distribution in a way that allows efficient learning.
arXiv Detail & Related papers (2023-05-26T12:30:05Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.