Related papers: On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning

On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning

URL: http://arxiv.org/abs/2502.01763v1
Date: Mon, 03 Feb 2025 19:08:32 GMT
Title: On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning
Authors: Thomas T. Zhang, Behrad Moniri, Ansh Nagwekar, Faraz Rahman, Anton Xue, Hamed Hassani, Nikolai Matni,
Abstract summary: We show that layer-wise preconditioning methods are provably necessary from a statistical perspective.<n>We show that SGD is a suboptimal feature when extending beyond ideal isotropic inputs.<n>We show that standard tools like Adam preconditioning and batch-norm only mildly mitigate these issues.
Score: 22.486361028522374
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Layer-wise preconditioning methods are a family of memory-efficient optimization algorithms that introduce preconditioners per axis of each layer's weight tensors. These methods have seen a recent resurgence, demonstrating impressive performance relative to entry-wise ("diagonal") preconditioning methods such as Adam(W) on a wide range of neural network optimization tasks. Complementary to their practical performance, we demonstrate that layer-wise preconditioning methods are provably necessary from a statistical perspective. To showcase this, we consider two prototypical models, linear representation learning and single-index learning, which are widely used to study how typical algorithms efficiently learn useful features to enable generalization. In these problems, we show SGD is a suboptimal feature learner when extending beyond ideal isotropic inputs $\mathbf{x} \sim \mathsf{N}(\mathbf{0}, \mathbf{I})$ and well-conditioned settings typically assumed in prior work. We demonstrate theoretically and numerically that this suboptimality is fundamental, and that layer-wise preconditioning emerges naturally as the solution. We further show that standard tools like Adam preconditioning and batch-norm only mildly mitigate these issues, supporting the unique benefits of layer-wise preconditioning.

Related papers

PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective [6.497756166630786]
We introduce a unifying framework for analyzing "matrix-aware" preconditioned methods.<n>We introduce PolarGrad, a new class of preconditioned optimization methods based on the polar decomposition of matrix-valued gradients.
arXiv Detail & Related papers (2025-05-27T22:11:21Z)
Training Deep Learning Models with Norm-Constrained LMOs [56.00317694850397]
We study optimization methods that leverage the linear minimization oracle (LMO) over a norm-ball. We propose a new family of algorithms that uses the LMO to adapt to the geometry of the problem and, perhaps surprisingly, show that they can be applied to unconstrained problems.
arXiv Detail & Related papers (2025-02-11T13:10:34Z)
Inference-Time Alignment in Diffusion Models with Reward-Guided Generation: Tutorial and Review [59.856222854472605]
This tutorial provides an in-depth guide on inference-time guidance and alignment methods for optimizing downstream reward functions in diffusion models.<n> practical applications in fields such as biology often require sample generation that maximizes specific metrics.<n>We discuss (1) fine-tuning methods combined with inference-time techniques, (2) inference-time algorithms based on search algorithms such as Monte Carlo tree search, and (3) connections between inference-time algorithms in language models and diffusion models.
arXiv Detail & Related papers (2025-01-16T17:37:35Z)
Learning from Linear Algebra: A Graph Neural Network Approach to Preconditioner Design for Conjugate Gradient Solvers [40.6591136324878]
We train GNNs to obtain preconditioners that reduce the condition number of the system more significantly than classical preconditioners.<n>Our approach outperforms both classical and neural network-based methods for an important class of parametric partial differential equations.
arXiv Detail & Related papers (2024-05-24T13:44:30Z)
The Convex Landscape of Neural Networks: Characterizing Global Optima and Stationary Points via Lasso Models [75.33431791218302]
Deep Neural Network Network (DNN) models are used for programming purposes. In this paper we examine the use of convex neural recovery models. We show that all the stationary non-dimensional objective objective can be characterized as the standard a global subsampled convex solvers program. We also show that all the stationary non-dimensional objective objective can be characterized as the standard a global subsampled convex solvers program.
arXiv Detail & Related papers (2023-12-19T23:04:56Z)
Stochastic Gradient Descent with Preconditioned Polyak Step-size [1.3300175008796402]
Gradient Descent with Polyak Step-size (SPS) is a method that offers an update rule that alleviates the need of fine-tuning the learning rate of an dataset. In this paper, we propose an extension of SPS that employs preconditioning techniques, such as Hutchinson's and Ada's, to improve its performance on badly scaled and/or ill-conditioned datasets.
arXiv Detail & Related papers (2023-10-03T14:36:05Z)
PROMISE: Preconditioned Stochastic Optimization Methods by Incorporating Scalable Curvature Estimates [17.777466668123886]
We introduce PROMISE ($textbfPr$econditioned $textbfO$ptimization $textbfM$ethods by $textbfI$ncorporating $textbfS$calable Curvature $textbfE$stimates), a suite of sketching-based preconditioned gradient algorithms. PROMISE includes preconditioned versions of SVRG, SAGA, and Katyusha.
arXiv Detail & Related papers (2023-09-05T07:49:10Z)
Stochastic Gradient Methods with Preconditioned Updates [47.23741709751474]
There are several algorithms for such problems, but existing methods often work poorly when badly scaled and/or ill-conditioned. Here we include preconditionimater based on Hutchinson's approach to approxing the diagonal Hessian. We prove convergence both when smoothness and the PL condition are assumed.
arXiv Detail & Related papers (2022-06-01T07:38:08Z)
On the One-sided Convergence of Adam-type Algorithms in Non-convex Non-concave Min-max Optimization [43.504548777955854]
We show that Adam-type algorithms converge to a one-sided first order stationary points in min-max optimization problems under the one-sided MVI condition. We also empirically verify that such one-sided MVI condition is satisfied for standard GANs after trained over standard data.
arXiv Detail & Related papers (2021-09-29T06:38:39Z)
Gone Fishing: Neural Active Learning with Fisher Embeddings [55.08537975896764]
There is an increasing need for active learning algorithms that are compatible with deep neural networks. This article introduces BAIT, a practical representation of tractable, and high-performing active learning algorithm for neural networks.
arXiv Detail & Related papers (2021-06-17T17:26:31Z)
Leveraging Non-uniformity in First-order Non-convex Optimization [93.6817946818977]
Non-uniform refinement of objective functions leads to emphNon-uniform Smoothness (NS) and emphNon-uniform Lojasiewicz inequality (NL) New definitions inspire new geometry-aware first-order methods that converge to global optimality faster than the classical $Omega (1/t2)$ lower bounds.
arXiv Detail & Related papers (2021-05-13T04:23:07Z)
Meta-Regularization: An Approach to Adaptive Choice of the Learning Rate in Gradient Descent [20.47598828422897]
We propose textit-Meta-Regularization, a novel approach for the adaptive choice of the learning rate in first-order descent methods. Our approach modifies the objective function by adding a regularization term, and casts the joint process parameters.
arXiv Detail & Related papers (2021-04-12T13:13:34Z)
Prior Guided Feature Enrichment Network for Few-Shot Segmentation [64.91560451900125]
State-of-the-art semantic segmentation methods require sufficient labeled data to achieve good results. Few-shot segmentation is proposed to tackle this problem by learning a model that quickly adapts to new classes with a few labeled support samples. Theses frameworks still face the challenge of generalization ability reduction on unseen classes due to inappropriate use of high-level semantic information.
arXiv Detail & Related papers (2020-08-04T10:41:32Z)
Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks. In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems. Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.