Mpemba Effect in Large-Language Model Training Dynamics: A Minimal Analysis of the Valley-River model
- URL: http://arxiv.org/abs/2507.04206v1
- Date: Sun, 06 Jul 2025 01:34:12 GMT
- Title: Mpemba Effect in Large-Language Model Training Dynamics: A Minimal Analysis of the Valley-River model
- Authors: Sibei Liu, Zhijian Hu,
- Abstract summary: Learning rate schedules in large language model (LLM) training often follow empirical templates: warm-up, constant plateau/stable phase, and decay.<n>We connect training dynamics to a thermodynamic analogy via the Mpemba effect.<n>We show that for certain loss landscapes, there exists an optimal plateau learning rate - the "strong Mpemba point"
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning rate (LR) schedules in large language model (LLM) training often follow empirical templates: warm-up, constant plateau/stable phase, and decay (WSD). However, the mechanistic explanation for this strategy remains underexplored, and the choice of plateau height and decay schedule is largely heuristic. In this paper, we connect training dynamics to a thermodynamic analogy via the Mpemba effect - a phenomenon in which a hotter system cools faster than a colder one when quenched into the same bath. We analyze a class of "valley-river" loss landscapes, where sharp (valley) directions equilibrate quickly, while flatter (river) directions govern global descent. The Mpemba effect provides an explanation for the necessity of the warm-up phase and motivates a high plateau - rather than a low one - for accelerating loss decrease during decay. We show that for certain loss landscapes, there exists an optimal plateau learning rate - the "strong Mpemba point" - at which the slowest mode vanishes, resulting in faster convergence during the decay phase. We derive analytical conditions for its existence and estimate decay dynamics required to preserve the Mpemba advantage. Our minimal model and analysis offer a principled justification for plateau-based schedulers and provide guidance for tuning LR in LLMs with minimal hyperparameter sweep.
Related papers
- WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training [64.0932926819307]
We present Warmup-Stable and Merge (WSM), a framework that establishes a formal connection between learning rate decay and model merging.<n>WSM provides a unified theoretical foundation for emulating various decay strategies.<n>Our framework consistently outperforms the widely-adopted Warmup-Stable-Decay (WSD) approach across multiple benchmarks.
arXiv Detail & Related papers (2025-07-23T16:02:06Z) - LoRA-MGPO: Mitigating Double Descent in Low-Rank Adaptation via Momentum-Guided Perturbation Optimization [12.504723188498]
Low-Rank Adaptation (LoRA) methods exhibit "double descent" in training loss as increases.<n>LoRA-MGPO is a novel LoRA-based framework incorporating Momentum-Guided Perturbation optimization (MGPO)<n> Experiments on natural language understanding and generation benchmarks demonstrate that LoRA-MGPO outperforms LoRA and state-of-the-art PEFT methods.
arXiv Detail & Related papers (2025-02-20T13:14:41Z) - TAUDiff: Highly efficient kilometer-scale downscaling using generative diffusion models [0.0]
It is crucial to achieve rapid turnaround, dynamical consistency, and accurate-temporal recovery for extreme weather events.<n>We propose an efficient diffusion model TAUDiff, that combines a deterministic-temporal model for mean field downscaling with a smaller generative diffusion model for recovering the fine-scale features.<n>Our approach can ensure quicker simulation of extreme events necessary for estimating associated risks and economic losses.
arXiv Detail & Related papers (2024-12-18T09:05:19Z) - ResFlow: Fine-tuning Residual Optical Flow for Event-based High Temporal Resolution Motion Estimation [50.80115710105251]
Event cameras hold significant promise for high-temporal-resolution (HTR) motion estimation.<n>We propose a residual-based paradigm for estimating HTR optical flow with event data.
arXiv Detail & Related papers (2024-12-12T09:35:47Z) - Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective [66.80315289020487]
Warmup-Stable-Decay (WSD) schedule uses a constant learning rate to produce a main branch of iterates that can continue indefinitely without a pre-specified compute budget.<n>We show that pretraining loss exhibits a river valley landscape, which resembles a deep valley with a river at its bottom.<n>Inspired by the theory, we introduce WSD-S, a variant of WSD that reuses previous checkpoints' decay phases and keeps only one main branch.
arXiv Detail & Related papers (2024-10-07T16:49:39Z) - Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality [54.20763128054692]
We study the dynamics of gradient flow for training a multi-head softmax attention model for in-context learning of multi-task linear regression.
We prove that an interesting "task allocation" phenomenon emerges during the gradient flow dynamics.
arXiv Detail & Related papers (2024-02-29T18:43:52Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - A Neural PDE Solver with Temporal Stencil Modeling [44.97241931708181]
Recent Machine Learning (ML) models have shown new promises in capturing important dynamics in high-resolution signals.
This study shows that significant information is often lost in the low-resolution down-sampled features.
We propose a new approach, which combines the strengths of advanced time-series sequence modeling and state-of-the-art neural PDE solvers.
arXiv Detail & Related papers (2023-02-16T06:13:01Z) - On regularization of gradient descent, layer imbalance and flat minima [9.08659783613403]
We analyze the training dynamics for deep linear networks using a new metric - imbalance - which defines the flatness of a solution.
We demonstrate that different regularization methods, such as weight decay or noise data augmentation, behave in a similar way.
arXiv Detail & Related papers (2020-07-18T00:09:14Z) - A Near-Optimal Gradient Flow for Learning Neural Energy-Based Models [93.24030378630175]
We propose a novel numerical scheme to optimize the gradient flows for learning energy-based models (EBMs)
We derive a second-order Wasserstein gradient flow of the global relative entropy from Fokker-Planck equation.
Compared with existing schemes, Wasserstein gradient flow is a smoother and near-optimal numerical scheme to approximate real data densities.
arXiv Detail & Related papers (2019-10-31T02:26:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.