Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling
- URL: http://arxiv.org/abs/2506.12543v1
- Date: Sat, 14 Jun 2025 15:37:31 GMT
- Title: Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling
- Authors: Teodora Srećković, Jonas Geiping, Antonio Orvieto,
- Abstract summary: Adam is known to perform significantly better than Gradient Descent (SGD) in language models.<n>We exhaustively study how momentum, gradient clipping, and batch size affect the gap between SGD and Adam.
- Score: 36.106114687828395
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adam is known to perform significantly better than Stochastic Gradient Descent (SGD) in language models, a phenomenon for which a number of explanations have been proposed. In this work, we revisit this "optimizer gap" through a series of comprehensively tuned baseline training runs for language modeling with Transformers. We exhaustively study how momentum, gradient clipping, and batch size affect the gap between SGD and Adam. Our empirical findings show that SGD with momentum can actually perform similarly to Adam in small-batch settings, if tuned correctly. We revisit existing explanations for Adam's advantage, including heavy-tailed class imbalance, directional sharpness, and Hessian heterogeneity, which struggle to directly explain this phenomenon. Towards bridging this gap in our understanding, by analyzing our Transformer training runs and simple quadratic settings inspired by the literature, we provide new insights, driven by stochastic differential equation models, into the role of batch size on the training dynamics.
Related papers
- In Search of Adam's Secret Sauce [11.215133680044005]
We train over 1,300 language models across different data configurations and scales.<n>We find that signed momentum methods are faster than SGD, but consistently underperform relative to Adam.<n>We show that Adam in this setting implements a natural online algorithm for estimating the mean and variance of gradients.
arXiv Detail & Related papers (2025-05-27T23:30:18Z) - AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training [22.58304858379219]
We introduce AdamS, a simple yet effective alternative to Adam for large language model (LLM) pretraining and post-training.<n>By leveraging a novel denominator, i.e., the root of weighted sum of squares of the momentum and the current gradient, AdamS eliminates the need for second-moment estimates.<n>AdamS is efficient, matching the memory and compute footprint of SGD with momentum while delivering superior optimization performance.
arXiv Detail & Related papers (2025-05-22T08:16:48Z) - Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [74.83412846804977]
Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models.<n>We present a systematic end-to-end study of RL fine-tuning for mathematical reasoning by training models entirely from scratch.
arXiv Detail & Related papers (2025-04-10T17:15:53Z) - Adam on Local Time: Addressing Nonstationarity in RL with Relative Adam Timesteps [65.64965527170156]
We adapt the widely used Adam optimiser for use in reinforcement learning.<n>We show that Adam-Rel uses the local timestep within an epoch, essentially resetting Adam's timestep to 0 after target changes.<n>We then show that increases in gradient norm occur in RL in practice, and examine the differences between our theoretical model and the observed data.
arXiv Detail & Related papers (2024-12-22T18:01:08Z) - Latent State Models of Training Dynamics [51.88132043461152]
We train models with different random seeds and compute a variety of metrics throughout training.
We then fit a hidden Markov model (HMM) over the resulting sequences of metrics.
We use the HMM representation to study phase transitions and identify latent "detour" states that slow down convergence.
arXiv Detail & Related papers (2023-08-18T13:20:08Z) - Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on
Transformers, but Sign Descent Might Be [16.170888329408353]
We show that the behavior of Adam with large batches is similar to sign descent with momentum.
We present evidence thatity and heavy-tailed noise are not major factors in the performance gap between SGD and Adam.
arXiv Detail & Related papers (2023-04-27T05:41:13Z) - Why Can GPT Learn In-Context? Language Models Implicitly Perform
Gradient Descent as Meta-Optimizers [93.9369467909176]
We explain language models as meta-optimizers and understand in-context learning as implicit finetuning.
We show that in-context learning behaves similarly to explicit finetuning from multiple perspectives.
The improved performance over vanilla attention further supports our understanding from another perspective.
arXiv Detail & Related papers (2022-12-20T18:58:48Z) - AdamP: Slowing Down the Slowdown for Momentum Optimizers on
Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning.
It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights.
In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z) - A Unified Analysis of AdaGrad with Weighted Aggregation and Momentum
Acceleration [21.929334023875874]
Integrating adaptive learning rate.
techniques into SGD leads to a large.
efficiently accelerated adaptive algorithms, such as AdaGrad, Adam.
AdaProp, Adam, AccAdaProp, Adam, RMSTOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
arXiv Detail & Related papers (2018-08-10T04:18:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.