Related papers: Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules

Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules

URL: http://arxiv.org/abs/2509.19189v3
Date: Mon, 03 Nov 2025 13:29:04 GMT
Title: Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules
Authors: Binghui Li, Fengling Chen, Zixun Huang, Lean Wang, Lei Wu,
Abstract summary: Scaling laws have emerged as a unifying lens for understanding and guiding the training of large language models.<n>We establish a Functional Scaling Law that captures the full loss trajectory under arbitrary LRSs.<n>We derive explicit scaling relations in both data- and compute-limited regimes.
Score: 9.332823269318842
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scaling laws have emerged as a unifying lens for understanding and guiding the training of large language models (LLMs). However, existing studies predominantly focus on the final-step loss, leaving open whether the entire loss dynamics obey similar laws and, crucially, how the learning rate schedule (LRS) shapes them. We address these gaps in a controlled theoretical setting by analyzing stochastic gradient descent (SGD) on a power-law kernel regression model. The key insight is a novel intrinsic-time viewpoint, which captures the training progress more faithfully than iteration count. We then establish a Functional Scaling Law (FSL) that captures the full loss trajectory under arbitrary LRSs, with the schedule's influence entering through a simple convolutional functional. We further instantiate the theory for three representative LRSs -- constant, exponential decay, and warmup-stable-decay (WSD) -- and derive explicit scaling relations in both data- and compute-limited regimes. These comparisons explain key empirical phenomena: (i) higher-capacity models are more data- and compute-efficient; (ii) learning-rate decay improves training efficiency; and (iii) WSD-type schedules outperform pure decay. Finally, experiments on LLMs ranging from 0.1B to 1B parameters demonstrate the practical relevance of FSL as a surrogate model for fitting and predicting loss trajectories in large-scale pre-training.

Related papers

Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay [9.371921537573346]
We study optimal learning-rate schedules (LRSs) under the functional scaling law.<n>LRSs accurately models the loss dynamics of both linear regression and large language model (LLM) pre-training.<n>We analyze optimal shape-fixed schedules, where only the peak learning rate is tuned.
arXiv Detail & Related papers (2026-02-06T15:52:30Z)
Consolidation or Adaptation? PRISM: Disentangling SFT and RL Data via Gradient Concentration [56.074760766965085]
PRISM achieves a dynamics-aware framework that arbitrates data based on its degree of cognitive conflict with the model's existing knowledge.<n>Our findings suggest that disentangling data based on internal optimization regimes is crucial for scalable and robust agent alignment.
arXiv Detail & Related papers (2026-01-12T05:43:20Z)
Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach [78.4812458793128]
We propose textbfTACO, a test-time-scaling framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks.<n>Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits.
arXiv Detail & Related papers (2025-12-02T14:42:54Z)
Learning to Reason as Action Abstractions with Scalable Mid-Training RL [55.24192942739207]
An effective mid-training phase should identify a compact set of useful actions and enable fast selection.<n>We propose Reasoning as Action Abstractions (RA3), a scalable mid-training algorithm.
arXiv Detail & Related papers (2025-09-30T05:34:20Z)
Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning [42.80470927369973]
We study how model scale, data volume, and computational budget interact to shape performance.<n>We find that larger models trained for fewer steps consistently outperform smaller models trained for more steps.<n>In data-constrained regimes, repeated reuse of high-quality data proves highly effective.
arXiv Detail & Related papers (2025-09-29T17:10:35Z)
Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training [16.736880202930482]
We show that Schedule-Free (SF) effectively navigates the "river" structure of the loss landscape without decay phases or auxiliary averaging.<n>We propose a refined variant of SF that improves to momentum and performs better under large batch sizes.
arXiv Detail & Related papers (2025-07-14T00:54:48Z)
AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining [12.630306478872043]
We propose textbfAdaLRS, a plug-in-and-play adaptive learning rate search algorithm that conducts online optimal learning rate search.<n>Experiments show that AdaLRS adjusts suboptimal learning rates to the neighborhood of optimum with marked efficiency and effectiveness.
arXiv Detail & Related papers (2025-06-16T09:14:01Z)
Implicit bias produces neural scaling laws in learning curves, from perceptrons to deep networks [12.108234998867337]
We identify two novel textitdynamical scaling laws that govern how performance evolves as function of different norm-based complexity measures.<n>Our findings are consistent across CNNs, ResNets, and Vision Transformers trained on MNIST, CIFAR-10 and CIFAR-100.<n>We provide analytical support using a single-layer perceptron trained with logistic loss, where we derive the new dynamical scaling laws, and we explain them through the implicit bias induced by gradient-based training.
arXiv Detail & Related papers (2025-05-19T15:13:36Z)
Scaling Laws for Data-Efficient Visual Transfer Learning [14.114908296325277]
This paper establishes the first practical framework for data-efficient scaling laws in visual transfer learning.<n>We propose the distillation boundary theory, revealing a critical turning point in distillation efficiency.<n>This work redefines scaling laws for data-limited regimes, bridging the knowledge gap between large-scale pretraining and practical downstream adaptation.
arXiv Detail & Related papers (2025-04-17T07:01:01Z)
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [74.83412846804977]
Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models.<n>We present a systematic end-to-end study of RL fine-tuning for mathematical reasoning by training models entirely from scratch.
arXiv Detail & Related papers (2025-04-10T17:15:53Z)
LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws [21.053622641336744]
Loss-to-loss scaling laws relate losses across pretraining datasets and downstream tasks.<n>Our experiments reveal that the pretraining data and tokenizer determine the scaling trend.
arXiv Detail & Related papers (2025-02-17T18:45:25Z)
Scaling Law for Stochastic Gradient Descent in Quadratically Parameterized Linear Regression [5.801904710149222]
In machine learning, the scaling law describes how the model performance improves with the model and data size scaling up.<n>This paper studies the scaling law over a linear regression with the model being quadratically parameterized.<n>As a result, in the canonical linear regression, we provide explicit separations for curves between generalization with and without feature learning, and the information-theoretical lower bound that is to parametrization method and the algorithm.
arXiv Detail & Related papers (2025-02-13T09:29:04Z)
Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance.<n>We introduce novel algorithms for dynamic, instance-level data reweighting.<n>Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z)
Scaling Laws for Predicting Downstream Performance in LLMs [75.28559015477137]
This work focuses on the pre-training loss as a more computation-efficient metric for performance estimation.<n>We present FLP-M, a fundamental approach for performance prediction that addresses the practical need to integrate datasets from multiple sources during pre-training.
arXiv Detail & Related papers (2024-10-11T04:57:48Z)
Efficient Continual Pre-training by Mitigating the Stability Gap [68.49269649759005]
We study the behavior of Large Language Models (LLMs) during continual pre-training. We propose three effective strategies to enhance LLM performance within a fixed compute budget. Our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget.
arXiv Detail & Related papers (2024-06-21T02:28:37Z)
Temporal Scaling Law for Large Language Models [70.74571133406958]
We propose the novel concept of Temporal Scaling Law, studying how the test loss of an LLM evolves as the training steps scale up.<n>In contrast to modeling the test loss as a whole in a coarse-grained manner, we break it down and dive into the fine-grained test loss of each token position.<n>We derive the much more precise temporal scaling law by studying the temporal patterns of the parameters in the dynamic hyperbolic-law.
arXiv Detail & Related papers (2024-04-27T05:49:11Z)
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs. We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.