Related papers: Old Optimizer, New Norm: An Anthology

Old Optimizer, New Norm: An Anthology

URL: http://arxiv.org/abs/2409.20325v1
Date: Mon, 30 Sep 2024 14:26:12 GMT
Title: Old Optimizer, New Norm: An Anthology
Authors: Jeremy Bernstein, Laker Newhouse,
Abstract summary: We argue that each method can instead be understood as a squarely first-order method without convexity assumptions. By generalizing this observation, we chart a new design space for training algorithms. We hope that this idea of carefully metrizing the neural architecture might lead to more stable, scalable and indeed faster training.
Score: 3.471637998699967
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep learning optimizers are often motivated through a mix of convex and approximate second-order theory. We select three such methods -- Adam, Shampoo and Prodigy -- and argue that each method can instead be understood as a squarely first-order method without convexity assumptions. In fact, after switching off exponential moving averages, each method is equivalent to steepest descent under a particular norm. By generalizing this observation, we chart a new design space for training algorithms. Different operator norms should be assigned to different tensors based on the role that the tensor plays within the network. For example, while linear and embedding layers may have the same weight space of $\mathbb{R}^{m\times n}$, these layers play different roles and should be assigned different norms. We hope that this idea of carefully metrizing the neural architecture might lead to more stable, scalable and indeed faster training.

Related papers

The Normalized Difference Layer: A Differentiable Spectral Index Formulation for Deep Learning [0.5131152350448098]
We introduce the Normalized Difference Layer that is a differentiable neural network module.<n>We present a complete mathematical framework for integrating this layer into deep learning architectures.<n> Experiments show that models using this layer reach similar classification accuracy to standard multilayer perceptrons.
arXiv Detail & Related papers (2026-01-11T05:03:01Z)
Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit [66.20349460098275]
We study the gradient descent learning of a general Gaussian Multi-index model $f(boldsymbolx)=g(boldsymbolUboldsymbolx)$ with hidden subspace $boldsymbolUin mathbbRrtimes d$.<n>We prove that under generic non-degenerate assumptions on the link function, a standard two-layer neural network trained via layer-wise gradient descent can agnostically learn the target with $o_d(1)$ test error.
arXiv Detail & Related papers (2025-11-19T04:46:47Z)
Training Deep Learning Models with Norm-Constrained LMOs [56.00317694850397]
We propose a new family of algorithms that uses the linear minimization oracle (LMO) to adapt to the geometry of the problem.<n>We demonstrate significant speedups on nanoGPT training using our algorithm, Scion, without any reliance on Adam.
arXiv Detail & Related papers (2025-02-11T13:10:34Z)
BALI: Learning Neural Networks via Bayesian Layerwise Inference [6.7819070167076045]
We introduce a new method for learning Bayesian neural networks, treating them as a stack of multivariate Bayesian linear regression models. The main idea is to infer the layerwise posterior exactly if we know the target outputs of each layer. We define these pseudo-targets as the layer outputs from the forward pass, updated by the backpropagated of the objective function.
arXiv Detail & Related papers (2024-11-18T22:18:34Z)
Modular Duality in Deep Learning [3.471637998699967]
We construct a duality map for general neural networks. Our map forms a unifying theoretical basis for training algorithms that are a) fast and b) scalable. Our iteration was recently used to set new speed records for training NanoGPT.
arXiv Detail & Related papers (2024-10-28T17:57:31Z)
Sparser, Better, Deeper, Stronger: Improving Sparse Training with Exact Orthogonal Initialization [49.06421851486415]
Static sparse training aims to train sparse models from scratch, achieving remarkable results in recent years. We propose Exact Orthogonal Initialization (EOI), a novel sparse Orthogonal Initialization scheme based on random Givens rotations. Our method enables training highly sparse 1000-layer and CNN networks without residual connections or normalization techniques.
arXiv Detail & Related papers (2024-06-03T19:44:47Z)
Adapting Newton's Method to Neural Networks through a Summary of Higher-Order Derivatives [0.0]
We consider a gradient-based optimization method applied to a function $boldsymboltheta$. This framework encompasses many common use-cases, such as training neural networks by gradient descent.
arXiv Detail & Related papers (2023-12-06T20:24:05Z)
Efficient Finite Initialization for Tensorized Neural Networks [41.94295877935867]
We present a novel method for initializing layers of tensorized neural networks in a way that avoids the explosion of the parameters of the matrix it emulates. We create a Python function to run it on an arbitrary layer, available in a Jupyter Notebook in the i3BQuantum repository.
arXiv Detail & Related papers (2023-09-11T08:05:09Z)
Online Learning for the Random Feature Model in the Student-Teacher Framework [0.0]
We study over-parametrization in the context of a student-teacher framework. For any finite ratio of hidden layer size and input dimension, the student cannot generalize perfectly. Only when the student's hidden layer size is exponentially larger than the input dimension, an approach to perfect generalization is possible.
arXiv Detail & Related papers (2023-03-24T15:49:02Z)
A Unified Algebraic Perspective on Lipschitz Neural Networks [88.14073994459586]
This paper introduces a novel perspective unifying various types of 1-Lipschitz neural networks. We show that many existing techniques can be derived and generalized via finding analytical solutions of a common semidefinite programming (SDP) condition. Our approach, called SDP-based Lipschitz Layers (SLL), allows us to design non-trivial yet efficient generalization of convex potential layers.
arXiv Detail & Related papers (2023-03-06T14:31:09Z)
Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization. We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z)
Leveraging Non-uniformity in First-order Non-convex Optimization [93.6817946818977]
Non-uniform refinement of objective functions leads to emphNon-uniform Smoothness (NS) and emphNon-uniform Lojasiewicz inequality (NL) New definitions inspire new geometry-aware first-order methods that converge to global optimality faster than the classical $Omega (1/t2)$ lower bounds.
arXiv Detail & Related papers (2021-05-13T04:23:07Z)
Beyond Lazy Training for Over-parameterized Tensor Decomposition [69.4699995828506]
We show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data. Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
arXiv Detail & Related papers (2020-10-22T00:32:12Z)
Backward Feature Correction: How Deep Learning Performs Deep (Hierarchical) Learning [66.05472746340142]
This paper analyzes how multi-layer neural networks can perform hierarchical learning _efficiently_ and _automatically_ by SGD on the training objective. We establish a new principle called "backward feature correction", where the errors in the lower-level features can be automatically corrected when training together with the higher-level layers.
arXiv Detail & Related papers (2020-01-13T17:28:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.