Related papers: Symmetric Rank-One Quasi-Newton Methods for Deep Learning Using Cubic Regularization

Symmetric Rank-One Quasi-Newton Methods for Deep Learning Using Cubic Regularization

URL: http://arxiv.org/abs/2502.12298v1
Date: Mon, 17 Feb 2025 20:20:11 GMT
Title: Symmetric Rank-One Quasi-Newton Methods for Deep Learning Using Cubic Regularization
Authors: Aditya Ranganath, Mukesh Singhal, Roummel Marcia,
Abstract summary: First-order descent and other first-order variants, such as Adam and AdaGrad, are commonly used in the field of deep learning.<n>However, these methods do not exploit curvature information.<n>Quasi-Newton methods re-use previously computed low Hessian approximations.
Score: 0.5120567378386615
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Stochastic gradient descent and other first-order variants, such as Adam and AdaGrad, are commonly used in the field of deep learning due to their computational efficiency and low-storage memory requirements. However, these methods do not exploit curvature information. Consequently, iterates can converge to saddle points or poor local minima. On the other hand, Quasi-Newton methods compute Hessian approximations which exploit this information with a comparable computational budget. Quasi-Newton methods re-use previously computed iterates and gradients to compute a low-rank structured update. The most widely used quasi-Newton update is the L-BFGS, which guarantees a positive semi-definite Hessian approximation, making it suitable in a line search setting. However, the loss functions in DNNs are non-convex, where the Hessian is potentially non-positive definite. In this paper, we propose using a limited-memory symmetric rank-one quasi-Newton approach which allows for indefinite Hessian approximations, enabling directions of negative curvature to be exploited. Furthermore, we use a modified adaptive regularized cubics approach, which generates a sequence of cubic subproblems that have closed-form solutions with suitable regularization choices. We investigate the performance of our proposed method on autoencoders and feed-forward neural network models and compare our approach to state-of-the-art first-order adaptive stochastic methods as well as other quasi-Newton methods.x

Related papers

Fast Unconstrained Optimization via Hessian Averaging and Adaptive Gradient Sampling Methods [0.3222802562733786]
We consider minimizing finite-sum expectation objective functions via Hessian-averaging based subsampled Newton methods. These methods allow for inexactness and have fixed per-it Hessian approximation costs. We present novel analysis techniques and propose challenges for their practical implementation.
arXiv Detail & Related papers (2024-08-14T03:27:48Z)
An Adaptive Stochastic Gradient Method with Non-negative Gauss-Newton Stepsizes [17.804065824245402]
In machine learning applications, each loss function is non-negative and can be expressed as the composition of a square and its real-valued square root. We show how to apply the Gauss-Newton method or the Levssquardt method to minimize the average of smooth but possibly non-negative functions.
arXiv Detail & Related papers (2024-07-05T08:53:06Z)
Incremental Quasi-Newton Methods with Faster Superlinear Convergence Rates [50.36933471975506]
We consider the finite-sum optimization problem, where each component function is strongly convex and has Lipschitz continuous gradient and Hessian. The recently proposed incremental quasi-Newton method is based on BFGS update and achieves a local superlinear convergence rate. This paper proposes a more efficient quasi-Newton method by incorporating the symmetric rank-1 update into the incremental framework.
arXiv Detail & Related papers (2024-02-04T05:54:51Z)
Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective. We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices. Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z)
Variance reduction techniques for stochastic proximal point algorithms [5.374800961359305]
In this work, we propose the first unified study of variance reduction techniques for proximal point algorithms. We introduce a generic proximal-based algorithm that can be specified to give the proximal version of SVRG, SAGA, and some of their variants. Our experiments demonstrate the advantages of the proximal variance reduction methods over their gradient counterparts.
arXiv Detail & Related papers (2023-08-18T05:11:50Z)
Unified Convergence Theory of Stochastic and Variance-Reduced Cubic Newton Methods [37.1630298053787]
We propose a new framework, which we call the helper framework. It provides a unified view of the variance and second-order algorithms equipped with global complexity guarantees.
arXiv Detail & Related papers (2023-02-23T12:18:28Z)
Robust empirical risk minimization via Newton's method [9.797319790710711]
A new variant of Newton's method for empirical risk minimization is studied. The gradient and Hessian of the objective function are replaced by robust estimators. An algorithm for obtaining robust Newton directions based on the conjugate gradient method is also proposed.
arXiv Detail & Related papers (2023-01-30T18:54:54Z)
SketchySGD: Reliable Stochastic Optimization via Randomized Curvature Estimates [19.420605210427635]
SketchySGD improves upon existing gradient methods in machine learning by using randomized low-rank approximations to the subsampled Hessian. We show theoretically that SketchySGD with a fixed stepsize converges linearly to a small ball around the optimum. In the ill-conditioned setting we show SketchySGD converges at a faster rate than SGD for least-squares problems.
arXiv Detail & Related papers (2022-11-16T01:05:41Z)
Stochastic Gradient Methods with Preconditioned Updates [47.23741709751474]
There are several algorithms for such problems, but existing methods often work poorly when badly scaled and/or ill-conditioned. Here we include preconditionimater based on Hutchinson's approach to approxing the diagonal Hessian. We prove convergence both when smoothness and the PL condition are assumed.
arXiv Detail & Related papers (2022-06-01T07:38:08Z)
Statistical Inference of Constrained Stochastic Optimization via Sketched Sequential Quadratic Programming [53.63469275932989]
We consider online statistical inference of constrained nonlinear optimization problems.<n>We apply the Sequential Quadratic Programming (StoSQP) method to solve these problems.
arXiv Detail & Related papers (2022-05-27T00:34:03Z)
Hessian Averaging in Stochastic Newton Methods Achieves Superlinear Convergence [69.65563161962245]
We consider a smooth and strongly convex objective function using a Newton method. We show that there exists a universal weighted averaging scheme that transitions to local convergence at an optimal stage.
arXiv Detail & Related papers (2022-04-20T07:14:21Z)
Nys-Curve: Nystr\"om-Approximated Curvature for Stochastic Optimization [20.189732632410024]
The quasi-Newton method provides curvature information by approximating the Hessian using the secant equation. We propose an approximate Newton step-based DP optimization algorithm for large-scale empirical risk of convex functions with linear convergence rates.
arXiv Detail & Related papers (2021-10-16T14:04:51Z)
Newton-LESS: Sparsification without Trade-offs for the Sketched Newton Update [88.73437209862891]
In second-order optimization, a potential bottleneck can be computing the Hessian matrix of the optimized function at every iteration. We show that the Gaussian sketching matrix can be drastically sparsified, significantly reducing the computational cost of sketching. We prove that Newton-LESS enjoys nearly the same problem-independent local convergence rate as Gaussian embeddings.
arXiv Detail & Related papers (2021-07-15T17:33:05Z)
Disentangling the Gauss-Newton Method and Approximate Inference for Neural Networks [96.87076679064499]
We disentangle the generalized Gauss-Newton and approximate inference for Bayesian deep learning. We find that the Gauss-Newton method simplifies the underlying probabilistic model significantly. The connection to Gaussian processes enables new function-space inference algorithms.
arXiv Detail & Related papers (2020-07-21T17:42:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.