Related papers: Enhance Curvature Information by Structured Stochastic Quasi-Newton Methods

Enhance Curvature Information by Structured Stochastic Quasi-Newton Methods

URL: http://arxiv.org/abs/2006.09606v2
Date: Thu, 25 Mar 2021 07:43:23 GMT
Title: Enhance Curvature Information by Structured Stochastic Quasi-Newton Methods
Authors: Minghan Yang, Dong Xu, Hongyu Chen, Zaiwen Wen and Mengyun Chen
Abstract summary: We consider second-order computation methods for minimizing a finite summation of non-linear functions. Since the true Hessian matrix is often a combination of a cheap part and an expensive part, we propose a structured quasi-Newton convergence method. Our proposed method is quite competitive to the stateofthe-art methods.
Score: 26.712594117460817
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we consider stochastic second-order methods for minimizing a finite summation of nonconvex functions. One important key is to find an ingenious but cheap scheme to incorporate local curvature information. Since the true Hessian matrix is often a combination of a cheap part and an expensive part, we propose a structured stochastic quasi-Newton method by using partial Hessian information as much as possible. By further exploiting either the low-rank structure or the kronecker-product properties of the quasi-Newton approximations, the computation of the quasi-Newton direction is affordable. Global convergence to stationary point and local superlinear convergence rate are established under some mild assumptions. Numerical results on logistic regression, deep autoencoder networks and deep convolutional neural networks show that our proposed method is quite competitive to the state-of-the-art methods.

Related papers

Simple Stepsize for Quasi-Newton Methods with Global Convergence Guarantees [46.16377198030688]
We extend the theoretical understanding of Quasi-Newton methods by introducing a simple stepsize schedule that guarantees a global convergence rate of $O (1/k)$ for the convex functions.<n>We show that when the inexactness of the Hessian approximation is controlled within a prescribed relative accuracy, the method attains an accelerated convergence rate of $O (1/k)$ -- matching the best-known rates of both Nesterov's accelerated gradient method and cubically regularized Newton methods.
arXiv Detail & Related papers (2025-08-27T09:18:51Z)
Symmetric Rank-One Quasi-Newton Methods for Deep Learning Using Cubic Regularization [0.5120567378386615]
First-order descent and other first-order variants, such as Adam and AdaGrad, are commonly used in the field of deep learning. However, these methods do not exploit curvature information. Quasi-Newton methods re-use previously computed low Hessian approximations.
arXiv Detail & Related papers (2025-02-17T20:20:11Z)
Online Learning Guided Quasi-Newton Methods with Global Non-Asymptotic Convergence [20.766358513158206]
We prove a global convergence rate of $O(min1/k,sqrtd/k1.25)$ in terms of the duality gap. These results are the first global convergence results to demonstrate a provable advantage of a quasi-Newton method over the extragradient method.
arXiv Detail & Related papers (2024-10-03T16:08:16Z)
Incremental Quasi-Newton Methods with Faster Superlinear Convergence Rates [50.36933471975506]
We consider the finite-sum optimization problem, where each component function is strongly convex and has Lipschitz continuous gradient and Hessian. The recently proposed incremental quasi-Newton method is based on BFGS update and achieves a local superlinear convergence rate. This paper proposes a more efficient quasi-Newton method by incorporating the symmetric rank-1 update into the incremental framework.
arXiv Detail & Related papers (2024-02-04T05:54:51Z)
Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective. We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices. Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z)
Unified Convergence Theory of Stochastic and Variance-Reduced Cubic Newton Methods [37.1630298053787]
We propose a new framework, which we call the helper framework. It provides a unified view of the variance and second-order algorithms equipped with global complexity guarantees.
arXiv Detail & Related papers (2023-02-23T12:18:28Z)
Linear Self-Attention Approximation via Trainable Feedforward Kernel [77.34726150561087]
In pursuit of faster computation, Efficient Transformers demonstrate an impressive variety of approaches. We aim to expand the idea of trainable kernel methods to approximate the self-attention mechanism of the Transformer architecture.
arXiv Detail & Related papers (2022-11-08T08:14:11Z)
Bayesian inference via sparse Hamiltonian flows [16.393322369105864]
A Bayesian coreset is a small, weighted subset of data that replaces the full dataset during Bayesian inference. Current methods tend to be slow, require a secondary inference step after coreset construction, and do not provide bounds on the data marginal evidence. We introduce a new method -- sparse Hamiltonian flows -- that addresses all three of these challenges.
arXiv Detail & Related papers (2022-03-11T02:36:59Z)
Byzantine-Resilient Non-Convex Stochastic Gradient Descent [61.6382287971982]
adversary-resilient distributed optimization, in which. machines can independently compute gradients, and cooperate. Our algorithm is based on a new concentration technique, and its sample complexity. It is very practical: it improves upon the performance of all prior methods when no. setting machines are present.
arXiv Detail & Related papers (2020-12-28T17:19:32Z)
Deep Magnification-Flexible Upsampling over 3D Point Clouds [103.09504572409449]
We propose a novel end-to-end learning-based framework to generate dense point clouds. We first formulate the problem explicitly, which boils down to determining the weights and high-order approximation errors. Then, we design a lightweight neural network to adaptively learn unified and sorted weights as well as the high-order refinements.
arXiv Detail & Related papers (2020-11-25T14:00:18Z)
Optimizing Mode Connectivity via Neuron Alignment [84.26606622400423]
Empirically, the local minima of loss functions can be connected by a learned curve in model space along which the loss remains nearly constant. We propose a more general framework to investigate effect of symmetry on landscape connectivity by accounting for the weight permutations of networks being connected.
arXiv Detail & Related papers (2020-09-05T02:25:23Z)
RFN: A Random-Feature Based Newton Method for Empirical Risk Minimization in Reproducing Kernel Hilbert Spaces [14.924672048447334]
Large-scale finite-sum problems can be solved using efficient variants of Newton method, where the Hessian is approximated via sub-samples of data. In this paper, we observe that for this class of problems, one can naturally use kernel approximation to speed up the Newton method. We provide a novel second-order algorithm that enjoys local superlinear convergence and global linear convergence.
arXiv Detail & Related papers (2020-02-12T01:14:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.