Conda: Column-Normalized Adam for Training Large Language Models Faster
- URL: http://arxiv.org/abs/2509.24218v2
- Date: Tue, 30 Sep 2025 02:02:30 GMT
- Title: Conda: Column-Normalized Adam for Training Large Language Models Faster
- Authors: Junjie Wang, Pan Zhou, Yiming Dong, Huan Li, Jia Li, Xun Zhou, Qicheng Lao, Cong Fang, Zhouchen Lin,
- Abstract summary: Column-Normalized Adam (Conda) is a novel approach to large language models (LLMs)<n>Conda projects updates into a subspace and applies column-wise second moment normalization based on the projected gradients.<n>Experiments on the LLaMA and GPT-2 series show that Conda consistently outperforms AdamW, Muon, and other baselines in pre-training.
- Score: 70.66067959375748
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have demonstrated impressive generalization and emergent capabilities, yet their pre-training remains computationally expensive and sensitive to optimization dynamics. While Adam-based optimizers offer fast convergence by adapting learning rates coordinate-wise, recent studies reveal that their updates often suffer from poor spectral conditioning and low-rank structures, hindering efficiency. Muon addresses this issue via global spectral normalization but lacks the per-coordinate adaptivity of Adam. In this work, we propose Column-Normalized Adam (Conda), a novel optimizer that bridges the strengths of both approaches. Conda projects updates into an orthogonal subspace and applies column-wise second moment normalization based on the projected gradients, thereby achieving both improved spectral conditioning and maintaining coordinate-wise adaptivity. This design alleviates the spectral pathologies of Adam while preserving its fast convergence behavior. Extensive experiments on the LLaMA and GPT-2 series show that Conda consistently outperforms AdamW, Muon, and other baselines in pre-training. Remarkably, on the LLaMA series, Conda achieves 2-2.5 the convergence speed of AdamW, measured in both training steps and training time. Further ablations demonstrate its robustness under diverse training setups. These results collectively highlight Conda as an effective and broadly applicable optimizer for large-scale LLM training. The code is released on https://github.com/jie040109/Conda
Related papers
- REG: A Regularization Optimizer for Robust Training Dynamics [24.850151895583494]
Row-and-Column-Scaling (RACS) operator regularizes the update steps in a less drastic manner, making it simpler to implement and more compatible with established training dynamics.<n>We demonstrate that our REG achieves superior performance and stability over AdamW, but also maintains consistency with the AdamW training paradigm.
arXiv Detail & Related papers (2025-10-04T06:05:57Z) - AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training [22.58304858379219]
We introduce AdamS, a simple yet effective alternative to Adam for large language model (LLM) pretraining and post-training.<n>By leveraging a novel denominator, i.e., the root of weighted sum of squares of the momentum and the current gradient, AdamS eliminates the need for second-moment estimates.<n>AdamS is efficient, matching the memory and compute footprint of SGD with momentum while delivering superior optimization performance.
arXiv Detail & Related papers (2025-05-22T08:16:48Z) - Muon is Scalable for LLM Training [50.68746986439438]
We introduce Moonlight, a Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon.<n>Our model improves the current frontier, achieving better performance with much fewer training FLOPs compared to prior models.<n>We open-source our distributed Muon implementation that is memory optimal and communication efficient.
arXiv Detail & Related papers (2025-02-24T09:12:29Z) - MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.67982828148859]
We propose a unified training framework for deep neural networks.<n>We introduce three instances of MARS that leverage preconditioned gradient optimization.<n>Results indicate that the implementation of MARS consistently outperforms Adam.
arXiv Detail & Related papers (2024-11-15T18:57:39Z) - LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics [37.21593513802284]
We introduce LDAdam, a memory-efficient gradient for training large models.<n>We show that LDAdam allows for accurate and efficient fine-tuning and pre-training of language models.
arXiv Detail & Related papers (2024-10-21T15:31:06Z) - Efficient Adaptive Human-Object Interaction Detection with
Concept-guided Memory [64.11870454160614]
We propose an efficient Adaptive HOI Detector with Concept-guided Memory (ADA-CM)
ADA-CM has two operating modes. The first mode makes it tunable without learning new parameters in a training-free paradigm.
Our proposed method achieves competitive results with state-of-the-art on the HICO-DET and V-COCO datasets with much less training time.
arXiv Detail & Related papers (2023-09-07T13:10:06Z) - Promoting Exploration in Memory-Augmented Adam using Critical Momenta [33.62231951499847]
We propose a memory-augmented version of Adam that encourages exploration towards flatter minima.
This buffer prompts the model to overshoot beyond narrow minima, promoting exploration.
We empirically demonstrate that it can improve model performance for image classification on ImageNet and CIFAR10/100, language modelling on Penn Treebank, and online learning tasks on TinyImageNet and 5-dataset.
arXiv Detail & Related papers (2023-07-18T20:59:52Z) - MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of
Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate.
This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z) - AdamP: Slowing Down the Slowdown for Momentum Optimizers on
Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning.
It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights.
In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.