Dual Gauss-Newton Directions for Deep Learning
- URL: http://arxiv.org/abs/2308.08886v2
- Date: Fri, 27 Oct 2023 00:02:18 GMT
- Title: Dual Gauss-Newton Directions for Deep Learning
- Authors: Vincent Roulet, Mathieu Blondel
- Abstract summary: Inspired by Gauss-Newton-like methods, we study the benefit of leveraging the structure of deep learning objectives.
We propose to compute such direction oracles via their dual formulation, leading to both computational benefits and new insights.
- Score: 16.77273032202006
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inspired by Gauss-Newton-like methods, we study the benefit of leveraging the
structure of deep learning objectives, namely, the composition of a convex loss
function and of a nonlinear network, in order to derive better direction
oracles than stochastic gradients, based on the idea of partial linearization.
In a departure from previous works, we propose to compute such direction
oracles via their dual formulation, leading to both computational benefits and
new insights. We demonstrate that the resulting oracles define descent
directions that can be used as a drop-in replacement for stochastic gradients,
in existing optimization algorithms. We empirically study the advantage of
using the dual formulation as well as the computational trade-offs involved in
the computation of such oracles.
Related papers
- Gradient Descent as a Perceptron Algorithm: Understanding Dynamics and Implicit Acceleration [67.12978375116599]
We show that the steps of gradient descent (GD) reduce to those of generalized perceptron algorithms.<n>This helps explain the optimization dynamics and the implicit acceleration phenomenon observed in neural networks.
arXiv Detail & Related papers (2025-12-12T14:16:35Z) - Nonlinear discretizations and Newton's method: characterizing stationary points of regression objectives [0.0]
We show that neural network training reliably fails when relying on exact curvature information.<n>The failure modes provide insight both into the geometry of nonlinear discretizations as well as the distribution of stationary points in the loss landscape.
arXiv Detail & Related papers (2025-10-13T22:29:52Z) - Spectral-factorized Positive-definite Curvature Learning for NN Training [39.296923519945814]
Training methods such as Adam(W) and Shampoo learn a positive-definite curvature matrix and apply an inverse root before preconditioning.
We propose a Riemannian optimization approach that dynamically adapts spectral-factorized positive-definite curvature estimates.
arXiv Detail & Related papers (2025-02-10T09:07:04Z) - Debiasing Mini-Batch Quadratics for Applications in Deep Learning [22.90473935350847]
Quadratic approximations form a fundamental building block of machine learning methods.
When computations on the entire training set are intractable - typical for deep learning - the relevant quantities are computed on mini-batches.
We show that this bias introduces a systematic error, (ii) provide a theoretical explanation for it, (iii) explain its relevance for second-order optimization and uncertainty via the Laplace approximation in deep learning, and (iv) develop and evaluate debiasing strategies.
arXiv Detail & Related papers (2024-10-18T09:37:05Z) - Adaptive Quantum Generative Training using an Unbounded Loss Function [1.0485739694839669]
We propose a generative quantum learning algorithm, R'enyi-ADAPT, using the Adaptive Derivative-Assembled Problem Tailored ansatz framework.
We benchmark this method against other state-of-the-art adaptive algorithms by learning random two-local thermal states.
We show that R'enyi-ADAPT is capable of constructing shallow quantum circuits competitive with existing methods, while the gradients remain favorable resulting from the maximal R'enyi divergence loss function.
arXiv Detail & Related papers (2024-08-01T01:04:53Z) - Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective.
We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices.
Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - A Recursively Recurrent Neural Network (R2N2) Architecture for Learning
Iterative Algorithms [64.3064050603721]
We generalize Runge-Kutta neural network to a recurrent neural network (R2N2) superstructure for the design of customized iterative algorithms.
We demonstrate that regular training of the weight parameters inside the proposed superstructure on input/output data of various computational problem classes yields similar iterations to Krylov solvers for linear equation systems, Newton-Krylov solvers for nonlinear equation systems, and Runge-Kutta solvers for ordinary differential equations.
arXiv Detail & Related papers (2022-11-22T16:30:33Z) - On the Benefits of Large Learning Rates for Kernel Methods [110.03020563291788]
We show that a phenomenon can be precisely characterized in the context of kernel methods.
We consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution.
arXiv Detail & Related papers (2022-02-28T13:01:04Z) - SHINE: SHaring the INverse Estimate from the forward pass for bi-level
optimization and implicit models [15.541264326378366]
In recent years, implicit deep learning has emerged as a method to increase the depth of deep neural networks.
The training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix.
We propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer.
arXiv Detail & Related papers (2021-06-01T15:07:34Z) - Channel-Directed Gradients for Optimization of Convolutional Neural
Networks [50.34913837546743]
We introduce optimization methods for convolutional neural networks that can be used to improve existing gradient-based optimization in terms of generalization error.
We show that defining the gradients along the output channel direction leads to a performance boost, while other directions can be detrimental.
arXiv Detail & Related papers (2020-08-25T00:44:09Z) - Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem.
We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent.
Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.