Related papers: Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training

Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training

URL: http://arxiv.org/abs/2507.09846v1
Date: Mon, 14 Jul 2025 00:54:48 GMT
Title: Through the River: Understanding the Benefit of Schedule-Free Methods for Language Model Training
Authors: Minhak Song, Beomhan Baek, Kwangjun Ahn, Chulhee Yun,
Abstract summary: We show that Schedule-Free (SF) effectively navigates the "river" structure of the loss landscape without decay phases or auxiliary averaging.<n>We propose a refined variant of SF that improves to momentum and performs better under large batch sizes.
Score: 16.736880202930482
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As both model and dataset sizes continue to scale rapidly, conventional pretraining strategies with fixed compute budgets-such as cosine learning rate schedules-are increasingly inadequate for large-scale training. Recent alternatives, including warmup-stable-decay (WSD) schedules and weight averaging, offer greater flexibility. However, WSD relies on explicit decay phases to track progress, while weight averaging addresses this limitation at the cost of additional memory. In search of a more principled and scalable alternative, we revisit the Schedule-Free (SF) method [Defazio et al., 2024], which has shown strong empirical performance across diverse settings. We show that SF-AdamW effectively navigates the "river" structure of the loss landscape without decay phases or auxiliary averaging, making it particularly suitable for continuously scaling training workloads. To understand this behavior, we conduct a theoretical and empirical analysis of SF dynamics, revealing that it implicitly performs weight averaging without memory overhead. Guided by this analysis, we propose a refined variant of SF that improves robustness to momentum and performs better under large batch sizes, addressing key limitations of the original method. Together, these results establish SF as a practical, scalable, and theoretically grounded approach for language model training.

Related papers

How to Set the Batch Size for Large-Scale Pre-training? [46.58311647781476]
This paper proposes a revised E(S) relationship tailored for the Warmup-Stable-Decay (WSD) learning rate scheduler.<n>Our theoretical analysis reveals two fundamental properties of WSD-based pre-training: 1) B_min, the minimum batch size threshold required to achieve a target loss, and 2) B_opt, the optimal batch size that maximizes data efficiency by minimizing total tokens.
arXiv Detail & Related papers (2026-01-08T15:43:31Z)
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices [61.361819972410046]
We show why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE.<n>This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training.
arXiv Detail & Related papers (2025-12-01T07:45:39Z)
Stable Coresets via Posterior Sampling: Aligning Induced and Full Loss Landscapes [7.446140380340418]
Coreset selection aims to accelerate training by identifying small, representative subsets of data that approximate the performance of the full dataset.<n> gradient based methods stand out due to their strong theoretical underpinnings and practical benefits, particularly under limited data budgets.<n>We propose a novel framework that addresses these limitations. First, we establish a connection between posterior sampling and loss landscapes, enabling robust coreset selection even in high data corruption scenarios.
arXiv Detail & Related papers (2025-11-21T17:00:00Z)
CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs [53.749193998004166]
Curriculum learning plays a crucial role in enhancing the training efficiency of large language models.<n>We propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead.
arXiv Detail & Related papers (2025-10-01T15:41:27Z)
Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules [9.332823269318842]
Scaling laws have emerged as a unifying lens for understanding and guiding the training of large language models.<n>We establish a Functional Scaling Law that captures the full loss trajectory under arbitrary LRSs.<n>We derive explicit scaling relations in both data- and compute-limited regimes.
arXiv Detail & Related papers (2025-09-23T16:05:16Z)
Optimal Growth Schedules for Batch Size and Learning Rate in SGD that Reduce SFO Complexity [0.6906005491572401]
Batch-size and learning-rate scheduling in computational gradient methods can degrade efficiency and compromise convergence.<n>We theoretically derived optimal growth schedules for the batch size and learning rate that reduce SFO complexity.<n>Our results offer both theoretical insights and practical guidelines for scalable and efficient large-batch training in deep learning.
arXiv Detail & Related papers (2025-08-07T11:52:25Z)
Predictive Scaling Laws for Efficient GRPO Training of Large Reasoning Models [0.41942958779358663]
We propose a predictive framework that models training dynamics and helps optimize resource usage.<n>We derive an empirical scaling law based on model size, initial performance, and training progress.<n>We find that training beyond certain number of an epoch offers little gain, suggesting earlier stopping can significantly reduce compute without sacrificing performance.
arXiv Detail & Related papers (2025-07-24T01:09:25Z)
Fast and Stable Diffusion Planning through Variational Adaptive Weighting [3.745003761050674]
Diffusion models have recently shown promise in offline RL.<n>These methods often suffer from high training costs and slow convergence.<n>We introduce a closed-form approximation method for its online estimation under the flow-based generative modeling framework.<n> Experimental results on Maze2D and Kitchen tasks show that our method achieves competitive performance with up to 10 times fewer training steps.
arXiv Detail & Related papers (2025-06-20T02:12:04Z)
AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining [12.630306478872043]
We propose textbfAdaLRS, a plug-in-and-play adaptive learning rate search algorithm that conducts online optimal learning rate search.<n>Experiments show that AdaLRS adjusts suboptimal learning rates to the neighborhood of optimum with marked efficiency and effectiveness.
arXiv Detail & Related papers (2025-06-16T09:14:01Z)
Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections [65.36449542323277]
We present a unified theoretical framework bridgingSupervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training.<n>We propose a simple yet effective learning rate reduction approach that yields significant performance improvements.
arXiv Detail & Related papers (2025-06-15T05:42:29Z)
On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling [11.168336416219857]
Existing infinite-width theory would predict instability under large learning rates and vanishing feature learning under stable learning rates.<n>We show that this discrepancy is not fully explained by finite-width phenomena such as catapult effects.<n>We validate that neural networks operate in this controlled divergence regime under CE loss but not under MSE loss.
arXiv Detail & Related papers (2025-05-28T15:40:48Z)
Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance.<n>We introduce novel algorithms for dynamic, instance-level data reweighting.<n>Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z)
The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws [51.608402959163925]
We present the first systematic exploration of optimal sparse pre-training configurations for large language models.<n>We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss.<n>We propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training.
arXiv Detail & Related papers (2025-01-21T20:23:22Z)
Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization. A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR. For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
Dynamic Scale Training for Object Detection [111.33112051962514]
We propose a Dynamic Scale Training paradigm (abbreviated as DST) to mitigate scale variation challenge in object detection. Experimental results demonstrate the efficacy of our proposed DST towards scale variation handling. It does not introduce inference overhead and could serve as a free lunch for general detection configurations.
arXiv Detail & Related papers (2020-04-26T16:48:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.