Pre-Training LLMs on a budget: A comparison of three optimizers
- URL: http://arxiv.org/abs/2507.08472v2
- Date: Tue, 22 Jul 2025 08:48:53 GMT
- Title: Pre-Training LLMs on a budget: A comparison of three optimizers
- Authors: Joel Schlotthauer, Christian Kroos, Chris Hinze, Viktor Hangya, Luzian Hahn, Fabian Küch,
- Abstract summary: We compare three major variants: the de-facto standard AdamW, the simpler Lion, and the second-order Sophia.<n>For better generalization, we train with two different base architectures and use a single- and a multiple-epoch approach.
- Score: 2.8090964770805207
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Optimizers play a decisive role in reducing pre-training times for LLMs and achieving better-performing models. In this study, we compare three major variants: the de-facto standard AdamW, the simpler Lion, developed through an evolutionary search, and the second-order optimizer Sophia. For better generalization, we train with two different base architectures and use a single- and a multiple-epoch approach while keeping the number of tokens constant. Using the Maximal Update Parametrization and smaller proxy models, we tune relevant hyperparameters separately for each combination of base architecture and optimizer. We found that while the results from all three optimizers were in approximately the same range, Sophia exhibited the lowest training and validation loss, Lion was fastest in terms of training GPU hours but AdamW led to the best downstream evaluation results.
Related papers
- The Impact of Fine-tuning Large Language Models on Automated Program Repair [5.868532677577195]
Automated Program Repair (APR) uses various tools and techniques to help developers achieve functional and error-free code faster.<n>Large Language Models (LLMs) have gained popularity as components in APR tool chains because of their performance and flexibility.<n>Fine-tuning techniques have been developed to adapt pre-trained LLMs to specific tasks, such as APR, and enhance their performance at far lower computational costs than training from scratch.
arXiv Detail & Related papers (2025-07-26T10:42:08Z) - It Takes a Good Model to Train a Good Model: Generalized Gaussian Priors for Optimized LLMs [15.263422862969803]
We introduce BackSlash, a training-time compression algorithm for large language models.<n>We propose a unified, end-to-end framework for LLM optimization based on the GG model.<n>Our contributions are threefold:.<n>DeepShape, a post-training regularization method that reshapes weight distributions to match a GG profile,.<n>RF8, a compact and hardware-efficient 8-bit floating-point format designed for GG-distributed-priord BackSlash training.
arXiv Detail & Related papers (2025-05-31T09:49:17Z) - C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing [21.119495676190127]
Mixture-of-Experts (MoE) Large Language Models (LLMs) suffer from severely sub-optimal expert pathways.<n> naive expert selection learned from pretraining leaves a surprising 10-20% accuracy gap for improvement.<n>We develop a novel class of test-time optimization methods to re-weight or "re-mixing" the experts in different layers jointly for each test sample.
arXiv Detail & Related papers (2025-04-10T17:59:56Z) - S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.<n>Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z) - MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
We propose a unified training framework for deep neural networks.<n>We introduce three instances of MARS that leverage preconditioned gradient optimization.<n>Results indicate that the implementation of MARS consistently outperforms Adam.
arXiv Detail & Related papers (2024-11-15T18:57:39Z) - Efficient Continual Pre-training by Mitigating the Stability Gap [68.49269649759005]
We study the behavior of Large Language Models (LLMs) during continual pre-training.
We propose three effective strategies to enhance LLM performance within a fixed compute budget.
Our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget.
arXiv Detail & Related papers (2024-06-21T02:28:37Z) - LLM as a Complementary Optimizer to Gradient Descent: A Case Study in Prompt Tuning [69.95292905263393]
We show that gradient-based and high-level LLMs can effectively collaborate a combined optimization framework.<n>In this paper, we show that these complementary to each other and can effectively collaborate a combined optimization framework.
arXiv Detail & Related papers (2024-05-30T06:24:14Z) - Incorporating Test-Time Optimization into Training with Dual Networks for Human Mesh Recovery [35.138312681232264]
We propose a dual-network architecture that unifies the training-time and test-time objectives.
Our method, armed with meta-learning and the dual networks, outperforms state-of-the-art regression-based and optimization-based HMR approaches.
arXiv Detail & Related papers (2024-01-25T12:04:53Z) - Symbolic Discovery of Optimization Algorithms [132.62397077095787]
We use efficient search techniques to explore an infinite and sparse program space.
Our method discovers a simple and effective optimization algorithm, $textbfLion$.
Lion is successfully deployed in production systems such as Google search ads CTR model.
arXiv Detail & Related papers (2023-02-13T20:27:30Z) - Learning to Optimize for Reinforcement Learning [58.01132862590378]
Reinforcement learning (RL) is essentially different from supervised learning, and in practice, these learneds do not work well even in simple RL tasks.
Agent-gradient distribution is non-independent and identically distributed, leading to inefficient meta-training.
We show that, although only trained in toy tasks, our learned can generalize unseen complex tasks in Brax.
arXiv Detail & Related papers (2023-02-03T00:11:02Z) - VeLO: Training Versatile Learned Optimizers by Scaling Up [67.90237498659397]
We leverage the same scaling approach behind the success of deep learning to learn versatiles.
We train an ingest for deep learning which is itself a small neural network that ingests and outputs parameter updates.
We open source our learned, meta-training code, the associated train test data, and an extensive benchmark suite with baselines at velo-code.io.
arXiv Detail & Related papers (2022-11-17T18:39:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.