Hessian Averaging in Stochastic Newton Methods Achieves Superlinear
Convergence
- URL: http://arxiv.org/abs/2204.09266v1
- Date: Wed, 20 Apr 2022 07:14:21 GMT
- Title: Hessian Averaging in Stochastic Newton Methods Achieves Superlinear
Convergence
- Authors: Sen Na, Micha{\l} Derezi\'nski, Michael W. Mahoney
- Abstract summary: We consider a smooth and strongly convex objective function using a Newton method.
We show that there exists a universal weighted averaging scheme that transitions to local convergence at an optimal stage.
- Score: 69.65563161962245
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider minimizing a smooth and strongly convex objective function using
a stochastic Newton method. At each iteration, the algorithm is given an oracle
access to a stochastic estimate of the Hessian matrix. The oracle model
includes popular algorithms such as the Subsampled Newton and Newton Sketch,
which can efficiently construct stochastic Hessian estimates for many tasks.
Despite using second-order information, these existing methods do not exhibit
superlinear convergence, unless the stochastic noise is gradually reduced to
zero during the iteration, which would lead to a computational blow-up in the
per-iteration cost. We address this limitation with Hessian averaging: instead
of using the most recent Hessian estimate, our algorithm maintains an average
of all past estimates. This reduces the stochastic noise while avoiding the
computational blow-up. We show that this scheme enjoys local $Q$-superlinear
convergence with a non-asymptotic rate of $(\Upsilon\sqrt{\log (t)/t}\,)^{t}$,
where $\Upsilon$ is proportional to the level of stochastic noise in the
Hessian oracle. A potential drawback of this (uniform averaging) approach is
that the averaged estimates contain Hessian information from the global phase
of the iteration, i.e., before the iterates converge to a local neighborhood.
This leads to a distortion that may substantially delay the superlinear
convergence until long after the local neighborhood is reached. To address this
drawback, we study a number of weighted averaging schemes that assign larger
weights to recent Hessians, so that the superlinear convergence arises sooner,
albeit with a slightly slower rate. Remarkably, we show that there exists a
universal weighted averaging scheme that transitions to local convergence at an
optimal stage, and still enjoys a superlinear convergence~rate nearly (up to a
logarithmic factor) matching that of uniform Hessian averaging.
Related papers
- Stochastic Newton Proximal Extragradient Method [18.47705532817026]
We propose a novel Newton extragradient method that improves these bounds.
We accomplish this by extending the Hybrid Proximal Extragradient (HPE) framework.
arXiv Detail & Related papers (2024-06-03T16:06:23Z) - Incremental Quasi-Newton Methods with Faster Superlinear Convergence
Rates [50.36933471975506]
We consider the finite-sum optimization problem, where each component function is strongly convex and has Lipschitz continuous gradient and Hessian.
The recently proposed incremental quasi-Newton method is based on BFGS update and achieves a local superlinear convergence rate.
This paper proposes a more efficient quasi-Newton method by incorporating the symmetric rank-1 update into the incremental framework.
arXiv Detail & Related papers (2024-02-04T05:54:51Z) - Fast Nonlinear Two-Time-Scale Stochastic Approximation: Achieving $O(1/k)$ Finite-Sample Complexity [2.5382095320488665]
This paper proposes to develop a new variant of the two-time-scale monotone approximation to find the roots of two coupled nonlinear operators.
Our key idea is to leverage the classic Ruppert-Polyak averaging technique to dynamically estimate the operators through their samples.
The estimated values of these averaging steps will then be used in the two-time-scale approximation updates to find the desired solution.
arXiv Detail & Related papers (2024-01-23T13:44:15Z) - Optimal Extragradient-Based Bilinearly-Coupled Saddle-Point Optimization [116.89941263390769]
We consider the smooth convex-concave bilinearly-coupled saddle-point problem, $min_mathbfxmax_mathbfyF(mathbfx) + H(mathbfx,mathbfy)$, where one has access to first-order oracles for $F$, $G$ as well as the bilinear coupling function $H$.
We present a emphaccelerated gradient-extragradient (AG-EG) descent-ascent algorithm that combines extragrad
arXiv Detail & Related papers (2022-06-17T06:10:20Z) - Accelerated SGD for Non-Strongly-Convex Least Squares [14.010916616909743]
We consider approximation for the least squares regression problem in the non-strongly convex setting.
We present the first practical algorithm that achieves the optimal prediction error rates in terms of dependence on the noise of the problem.
arXiv Detail & Related papers (2022-03-03T14:39:33Z) - Nesterov Accelerated Shuffling Gradient Method for Convex Optimization [15.908060383231371]
We show that our algorithm has an improved rate of $mathcalO (1/T)$ using unified shuffling schemes.
Our convergence analysis does not require an assumption on bounded domain or a bounded gradient condition.
Numerical simulations demonstrate the efficiency of our algorithm.
arXiv Detail & Related papers (2022-02-07T21:23:17Z) - Nys-Curve: Nystr\"om-Approximated Curvature for Stochastic Optimization [20.189732632410024]
The quasi-Newton method provides curvature information by approximating the Hessian using the secant equation.
We propose an approximate Newton step-based DP optimization algorithm for large-scale empirical risk of convex functions with linear convergence rates.
arXiv Detail & Related papers (2021-10-16T14:04:51Z) - On Stochastic Moving-Average Estimators for Non-Convex Optimization [105.22760323075008]
In this paper, we demonstrate the power of a widely used estimator based on moving average (SEMA) problems.
For all these-the-art results, we also present the results for all these-the-art problems.
arXiv Detail & Related papers (2021-04-30T08:50:24Z) - A Momentum-Assisted Single-Timescale Stochastic Approximation Algorithm
for Bilevel Optimization [112.59170319105971]
We propose a new algorithm -- the Momentum- Single-timescale Approximation (MSTSA) -- for tackling problems.
MSTSA allows us to control the error in iterations due to inaccurate solution to the lower level subproblem.
arXiv Detail & Related papers (2021-02-15T07:10:33Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.