Related papers: Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants

Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants

URL: http://arxiv.org/abs/2502.02431v1
Date: Tue, 04 Feb 2025 15:55:35 GMT
Title: Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants
Authors: Depen Morwani, Nikhil Vyas, Hanlin Zhang, Sham Kakade,
Abstract summary: We show that AdEMAMix most closely resembles accelerated versions of gradient descent.<n>We introduce a modification to AdEMAMix, termed Simplified-AdEMAMix, which maintains the same performance across both large and small batch-size settings.
Score: 5.08749017242817
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in deep learning optimization have introduced new algorithms, such as Schedule-Free optimizers, AdEMAMix, MARS and Lion which modify traditional momentum mechanisms. In a separate line of work, theoretical acceleration of stochastic gradient descent (SGD) in noise-dominated regime has been achieved by decoupling the momentum coefficient from the current gradient's weight. In this paper, we establish explicit connections between these two lines of work. We substantiate our theoretical findings with preliminary experiments on a 150m language modeling task. We find that AdEMAMix, which most closely resembles accelerated versions of stochastic gradient descent, exhibits superior performance. Building on these insights, we introduce a modification to AdEMAMix, termed Simplified-AdEMAMix, which maintains the same performance as AdEMAMix across both large and small batch-size settings while eliminating the need for two different momentum terms. The code for Simplified-AdEMAMix is available on the repository: https://github.com/DepenM/Simplified-AdEMAMix/.

Related papers

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z)
Exponential Moving Average of Weights in Deep Learning: Dynamics and Benefits [11.801688624472009]
We present a systematic study of the Exponential Moving Average (EMA) of weights. We show that EMA solutions differ from last-iterate solutions. We suggest that an EMA of weights is a simple yet effective plug-in to improve the performance of deep learning models.
arXiv Detail & Related papers (2024-11-27T19:14:27Z)
MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
We propose a unified training framework for deep neural networks. We introduce three instances of MARS that leverage preconditioned gradient optimization. Results indicate that the implementation of MARS consistently outperforms Adam.
arXiv Detail & Related papers (2024-11-15T18:57:39Z)
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models. Our approach employs activation sparsity to extract experts. Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z)
The AdEMAMix Optimizer: Better, Faster, Older [24.470432924661324]
This work questions the use of a single EMA to accumulate past gradients and empirically demonstrates how this choice can be sub-optimal. We propose AdEMAMix, a simple modification of the Adam with a mixture of two EMAs to better take advantage of past gradients. Our experiments on language modeling and image classification show -- quite surprisingly -- that gradients can stay relevant for tens of thousands of steps.
arXiv Detail & Related papers (2024-09-05T00:13:16Z)
Fast Semisupervised Unmixing Using Nonconvex Optimization [80.11512905623417]
We introduce a novel convex convex model for semi/library-based unmixing. We demonstrate the efficacy of Alternating Methods of sparse unsupervised unmixing.
arXiv Detail & Related papers (2024-01-23T10:07:41Z)
PowMix: A Versatile Regularizer for Multimodal Sentiment Analysis [71.8946280170493]
This paper introduces PowMix, a versatile embedding space regularizer that builds upon the strengths of unimodal mixing-based regularization approaches. PowMix is integrated before the fusion stage of multimodal architectures and facilitates intra-modal mixing, such as mixing text with text, to act as a regularizer.
arXiv Detail & Related papers (2023-12-19T17:01:58Z)
Sparse Backpropagation for MoE Training [118.31785160874024]
We introduce SparseMixer, a scalable gradient estimator that bridges the gap between backpropagation and sparse expert routing. Grounded in a numerical ODE framework, SparseMixer harnesses the mid-point method, a second-order ODE solver, to deliver precise gradient approximations. Applying SparseMixer to Switch Transformer on both pre-training and machine translation tasks, SparseMixer showcases considerable performance gain.
arXiv Detail & Related papers (2023-10-01T22:43:57Z)
Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks. We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights. Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z)
Learned Gradient of a Regularizer for Plug-and-Play Gradient Descent [37.41458921829744]
The Plug-and-Play framework allows integrating advanced image denoising priors into algorithms. Regularization by Denoising (RED) algorithms are two examples of methods that made a breakthrough in image restoration. We show that it is possible to train a denoiser along with a network that corresponds to the gradient of its regularizer.
arXiv Detail & Related papers (2022-04-29T08:33:33Z)
A Mixture of Expert Based Deep Neural Network for Improved ASR [4.993304210475779]
MixNet is a novel deep learning architecture for acoustic model in the context of Automatic Speech Recognition (ASR) In natural speech, overlap in distribution across different acoustic classes is inevitable, which leads to inter-class mis-classification. Experiments are conducted on a large vocabulary ASR task which show that the proposed architecture provides 13.6% and 10.0% relative reduction in word error rates.
arXiv Detail & Related papers (2021-12-02T07:26:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.