An Isometric Stochastic Optimizer
- URL: http://arxiv.org/abs/2307.12979v1
- Date: Mon, 24 Jul 2023 17:56:58 GMT
- Title: An Isometric Stochastic Optimizer
- Authors: Jacob Jackson
- Abstract summary: Adam is the standard choice in deep learning applications.
I propose a simple explanation of Adam's success: it makes each parameter's step size independent of the norms of the other parameters.
I derive Iso, a new approach which makes the norm of a parameter's update invariant to the application of any linear transformation to its inputs and outputs.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Adam optimizer is the standard choice in deep learning applications. I
propose a simple explanation of Adam's success: it makes each parameter's step
size independent of the norms of the other parameters. Based on this principle
I derive Iso, a new optimizer which makes the norm of a parameter's update
invariant to the application of any linear transformation to its inputs and
outputs. I develop a variant of Iso called IsoAdam that allows optimal
hyperparameters to be transferred from Adam, and demonstrate that IsoAdam
obtains a speedup over Adam when training a small Transformer.
Related papers
- AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training [22.58304858379219]
We introduce AdamS, a simple yet effective alternative to Adam for large language model (LLM) pretraining and post-training.<n>By leveraging a novel denominator, i.e., the root of weighted sum of squares of the momentum and the current gradient, AdamS eliminates the need for second-moment estimates.<n>AdamS is efficient, matching the memory and compute footprint of SGD with momentum while delivering superior optimization performance.
arXiv Detail & Related papers (2025-05-22T08:16:48Z) - Towards Simple and Provable Parameter-Free Adaptive Gradient Methods [56.060918447252625]
We present AdaGrad++ and Adam++, novel and simple parameter-free variants of AdaGrad and Adam with convergence guarantees.
We prove that AdaGrad++ achieves comparable convergence rates to AdaGrad in convex optimization without predefined learning rate assumptions. Similarly, Adam++ matches the convergence rate of Adam without relying on any conditions on the learning rates.
arXiv Detail & Related papers (2024-12-27T04:22:02Z) - CAdam: Confidence-Based Optimization for Online Learning [35.84013976735154]
We introduce CAdam, a confidence-based optimization strategy that assesses the consistence between the momentum and the gradient for each parameter dimension before deciding on updates.
Our experiments with both synthetic and real-world datasets demonstrate that CAdam surpasses other well-known systems.
In large-scale A/B testing within a live recommendation system, CAdam significantly enhances model performance compared to Adam.
arXiv Detail & Related papers (2024-11-29T12:00:27Z) - Continuous-Time Analysis of Adaptive Optimization and Normalization [5.954511401622424]
Adaptive optimization algorithms, particularly Adam and its variant AdamW, are fundamental components of modern deep learning.
This work presents a continuous-time formulation of Adam and AdamW, facilitating a tractable analysis of training dynamics.
arXiv Detail & Related papers (2024-11-08T18:07:55Z) - LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics [37.21593513802284]
We introduce LDAdam, a memory-efficient gradient for training large models.
We show that LDAdam allows for accurate and efficient fine-tuning and pre-training of language models.
arXiv Detail & Related papers (2024-10-21T15:31:06Z) - Scaling Exponents Across Parameterizations and Optimizers [94.54718325264218]
We propose a new perspective on parameterization by investigating a key assumption in prior work.
Our empirical investigation includes tens of thousands of models trained with all combinations of threes.
We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work.
arXiv Detail & Related papers (2024-07-08T12:32:51Z) - Provable Adaptivity of Adam under Non-uniform Smoothness [79.25087082434975]
Adam is widely adopted in practical applications due to its fast convergence.
Existing convergence analyses for Adam rely on the bounded smoothness assumption.
This paper studies the convergence of randomly reshuffled Adam with diminishing learning rate.
arXiv Detail & Related papers (2022-08-21T14:57:47Z) - A Novel Convergence Analysis for Algorithms of the Adam Family [105.22760323075008]
We present a generic proof of convergence for a family of Adam-style methods including Adam, AMSGrad, Adabound, etc.
Our analysis is so simple and generic that it can be leveraged to establish the convergence for solving a broader family of non- compositional optimization problems.
arXiv Detail & Related papers (2021-12-07T02:47:58Z) - Adam$^+$: A Stochastic Method with Adaptive Variance Reduction [56.051001950733315]
Adam is a widely used optimization method for deep learning applications.
We propose a new method named Adam$+$ (pronounced as Adam-plus)
Our empirical studies on various deep learning tasks, including image classification, language modeling, and automatic speech recognition, demonstrate that Adam$+$ significantly outperforms Adam.
arXiv Detail & Related papers (2020-11-24T09:28:53Z) - EAdam Optimizer: How $\epsilon$ Impact Adam [7.0552555621312605]
We discuss the impact of the constant $epsilon$ for Adam in this paper.
Based on this finding, we propose a new variant of Adam called EAdam.
Our method can bring significant improvement compared with Adam.
arXiv Detail & Related papers (2020-11-04T06:39:44Z) - MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of
Gradients [112.00379151834242]
We propose adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance each coordinate.
This results in faster adaptation, which leads more desirable empirical convergence behaviors.
arXiv Detail & Related papers (2020-06-21T21:47:43Z) - A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms.
Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.