Related papers: Benchmarking Optimizers for Large Language Model Pretraining

Benchmarking Optimizers for Large Language Model Pretraining

URL: http://arxiv.org/abs/2509.01440v1
Date: Mon, 01 Sep 2025 12:50:30 GMT
Title: Benchmarking Optimizers for Large Language Model Pretraining
Authors: Andrei Semenov, Matteo Pagliardini, Martin Jaggi,
Abstract summary: This study presents a comprehensive evaluation of recent optimization techniques across standardized pretraining scenarios.<n>We provide guidance to practitioners on which is best suited for each scenario.
Score: 46.1830130330317
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The recent development of Large Language Models (LLMs) has been accompanied by an effervescence of novel ideas and methods to better optimize the loss of deep learning models. Claims from those methods are myriad: from faster convergence to removing reliance on certain hyperparameters. However, the diverse experimental protocols used to validate these claims make direct comparisons between methods challenging. This study presents a comprehensive evaluation of recent optimization techniques across standardized LLM pretraining scenarios, systematically varying model size, batch size, and training duration. Through careful tuning of each method, we provide guidance to practitioners on which optimizer is best suited for each scenario. For researchers, our work highlights promising directions for future optimization research. Finally, by releasing our code and making all experiments fully reproducible, we hope our efforts can help the development and rigorous benchmarking of future methods.

Related papers

Scaling and Transferability of Annealing Strategies in Large Language Model Training [59.443651879173025]
We refine a predictive framework for optimizing annealing strategies under the Warmup-Steady-Decay (WSD) scheduler.<n>Our improved framework incorporates training steps, maximum learning rate, and annealing behavior, enabling more efficient optimization of learning rate schedules.<n>We validate our findings on extensive experiments using both Dense and Mixture-of-Experts (MoE) models.
arXiv Detail & Related papers (2025-12-05T16:38:33Z)
FoMEMO: Towards Foundation Models for Expensive Multi-objective Optimization [19.69959362934787]
We propose a new paradigm named FoMEMO, which enables the establishment of a foundation model conditioned on any domain trajectory and user preference.<n>Rather than accessing extensive domain experiments in the real world, we demonstrate that pre-training the foundation model with a diverse set of hundreds of millions of synthetic data can lead to superior adaptability to unknown problems.
arXiv Detail & Related papers (2025-09-03T12:00:24Z)
Divergence Minimization Preference Optimization for Diffusion Model Alignment [66.31417479052774]
Divergence Minimization Preference Optimization (DMPO) is a principled method for aligning diffusion models by minimizing reverse KL divergence.<n>DMPO can consistently outperform or match existing techniques across different base models and test sets.
arXiv Detail & Related papers (2025-07-10T07:57:30Z)
Optimization-Inspired Few-Shot Adaptation for Large Language Models [25.439708260502556]
Large Language Models (LLMs) have demonstrated remarkable performance in real-world applications.<n>Adapting LLMs to novel tasks via fine-tuning often requires substantial training data and computational resources that are impractical in few-shot scenarios.<n>Existing approaches, such as in-context learning and.<n>Efficient Fine-Tuning (PEFT), face key limitations.
arXiv Detail & Related papers (2025-05-25T11:54:23Z)
A Survey of Direct Preference Optimization [103.59317151002693]
Large Language Models (LLMs) have demonstrated unprecedented generative capabilities.<n>Their alignment with human values remains critical for ensuring helpful and harmless deployments.<n>Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative.
arXiv Detail & Related papers (2025-03-12T08:45:15Z)
Align-Pro: A Principled Approach to Prompt Optimization for LLM Alignment [40.71270945505082]
Large language models (LLMs) are increasingly integrated into various societal and decision-making processes.<n>Traditional methods, such as reinforcement learning from human feedback (RLHF), achieve alignment by fine-tuning model parameters.<n>In contrast, prompt optimization is a viable alternative to RLHF for LLM alignment.
arXiv Detail & Related papers (2025-01-07T03:14:39Z)
Plug-and-Play Training Framework for Preference Optimization [25.53286104242179]
We propose a novel training framework for large language models (LLMs)<n>This framework employs multiple sampling to analyze output distributions, assign different weights to samples, and incorporate these weights into the preference optimization process.<n> Experimental results demonstrate that our framework integrates seamlessly with various preference optimization methods and achieves consistent improvements in mathematical reasoning tasks.
arXiv Detail & Related papers (2024-12-30T15:01:48Z)
AIPO: Improving Training Objective for Iterative Preference Optimization [34.24211649396053]
We study iterative preference optimization with synthetic data. We propose our training objective for iterative preference optimization, namely Agreement-aware Iterative Preference Optimization (AIPO)
arXiv Detail & Related papers (2024-09-13T14:03:49Z)
Unleashing the Potential of Large Language Models as Prompt Optimizers: Analogical Analysis with Gradient-based Model Optimizers [108.72225067368592]
We propose a novel perspective to investigate the design of large language models (LLMs)-based prompts.<n>We identify two pivotal factors in model parameter learning: update direction and update method.<n>We develop a capable Gradient-inspired Prompt-based GPO.
arXiv Detail & Related papers (2024-02-27T15:05:32Z)
A Thorough Examination of Decoding Methods in the Era of LLMs [72.65956436513241]
Decoding methods play an indispensable role in converting language models from next-token predictors into practical task solvers. This paper provides a comprehensive and multifaceted analysis of various decoding methods within the context of large language models. Our findings reveal that decoding method performance is notably task-dependent and influenced by factors such as alignment, model size, and quantization.
arXiv Detail & Related papers (2024-02-10T11:14:53Z)
PerfRL: A Small Language Model Framework for Efficient Code Optimization [14.18092813639534]
In this paper, we introduce PerfRL, an innovative framework designed to tackle the problem of code optimization.<n>Our framework leverages the capabilities of small language models (SLMs) and reinforcement learning (RL)<n>Our approach achieves similar or better results compared to state-of-the-art models using shorter training times and smaller pre-trained models.
arXiv Detail & Related papers (2023-12-09T19:50:23Z)
Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore. We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.