Hessian Averaging in Stochastic Newton Methods Achieves Superlinear
Convergence
- URL: http://arxiv.org/abs/2204.09266v1
- Date: Wed, 20 Apr 2022 07:14:21 GMT
- Title: Hessian Averaging in Stochastic Newton Methods Achieves Superlinear
Convergence
- Authors: Sen Na, Micha{\l} Derezi\'nski, Michael W. Mahoney
- Abstract summary: We consider a smooth and strongly convex objective function using a Newton method.
We show that there exists a universal weighted averaging scheme that transitions to local convergence at an optimal stage.
- Score: 69.65563161962245
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider minimizing a smooth and strongly convex objective function using
a stochastic Newton method. At each iteration, the algorithm is given an oracle
access to a stochastic estimate of the Hessian matrix. The oracle model
includes popular algorithms such as the Subsampled Newton and Newton Sketch,
which can efficiently construct stochastic Hessian estimates for many tasks.
Despite using second-order information, these existing methods do not exhibit
superlinear convergence, unless the stochastic noise is gradually reduced to
zero during the iteration, which would lead to a computational blow-up in the
per-iteration cost. We address this limitation with Hessian averaging: instead
of using the most recent Hessian estimate, our algorithm maintains an average
of all past estimates. This reduces the stochastic noise while avoiding the
computational blow-up. We show that this scheme enjoys local $Q$-superlinear
convergence with a non-asymptotic rate of $(\Upsilon\sqrt{\log (t)/t}\,)^{t}$,
where $\Upsilon$ is proportional to the level of stochastic noise in the
Hessian oracle. A potential drawback of this (uniform averaging) approach is
that the averaged estimates contain Hessian information from the global phase
of the iteration, i.e., before the iterates converge to a local neighborhood.
This leads to a distortion that may substantially delay the superlinear
convergence until long after the local neighborhood is reached. To address this
drawback, we study a number of weighted averaging schemes that assign larger
weights to recent Hessians, so that the superlinear convergence arises sooner,
albeit with a slightly slower rate. Remarkably, we show that there exists a
universal weighted averaging scheme that transitions to local convergence at an
optimal stage, and still enjoys a superlinear convergence~rate nearly (up to a
logarithmic factor) matching that of uniform Hessian averaging.
Related papers
- An Accelerated Algorithm for Stochastic Bilevel Optimization under Unbounded Smoothness [15.656614304616006]
This paper investigates a class of bilevel optimization problems where the upper-level function is non- unbounded smoothness and the lower-level problem is strongly convex.
These problems have significant applications in data learning, such as text classification using neural networks.
arXiv Detail & Related papers (2024-09-28T02:30:44Z) - A Sample Efficient Alternating Minimization-based Algorithm For Robust Phase Retrieval [56.67706781191521]
In this work, we present a robust phase retrieval problem where the task is to recover an unknown signal.
Our proposed oracle avoids the need for computationally spectral descent, using a simple gradient step and outliers.
arXiv Detail & Related papers (2024-09-07T06:37:23Z) - Fast Unconstrained Optimization via Hessian Averaging and Adaptive Gradient Sampling Methods [0.3222802562733786]
We consider minimizing finite-sum expectation objective functions via Hessian-averaging based subsampled Newton methods.
These methods allow for inexactness and have fixed per-it Hessian approximation costs.
We present novel analysis techniques and propose challenges for their practical implementation.
arXiv Detail & Related papers (2024-08-14T03:27:48Z) - Stochastic Newton Proximal Extragradient Method [18.47705532817026]
We propose a novel Newton extragradient method that improves these bounds.
We accomplish this by extending the Hybrid Proximal Extragradient (HPE) framework.
arXiv Detail & Related papers (2024-06-03T16:06:23Z) - Incremental Quasi-Newton Methods with Faster Superlinear Convergence
Rates [50.36933471975506]
We consider the finite-sum optimization problem, where each component function is strongly convex and has Lipschitz continuous gradient and Hessian.
The recently proposed incremental quasi-Newton method is based on BFGS update and achieves a local superlinear convergence rate.
This paper proposes a more efficient quasi-Newton method by incorporating the symmetric rank-1 update into the incremental framework.
arXiv Detail & Related papers (2024-02-04T05:54:51Z) - Optimal Extragradient-Based Bilinearly-Coupled Saddle-Point Optimization [116.89941263390769]
We consider the smooth convex-concave bilinearly-coupled saddle-point problem, $min_mathbfxmax_mathbfyF(mathbfx) + H(mathbfx,mathbfy)$, where one has access to first-order oracles for $F$, $G$ as well as the bilinear coupling function $H$.
We present a emphaccelerated gradient-extragradient (AG-EG) descent-ascent algorithm that combines extragrad
arXiv Detail & Related papers (2022-06-17T06:10:20Z) - Nesterov Accelerated Shuffling Gradient Method for Convex Optimization [15.908060383231371]
We show that our algorithm has an improved rate of $mathcalO (1/T)$ using unified shuffling schemes.
Our convergence analysis does not require an assumption on bounded domain or a bounded gradient condition.
Numerical simulations demonstrate the efficiency of our algorithm.
arXiv Detail & Related papers (2022-02-07T21:23:17Z) - Nys-Curve: Nystr\"om-Approximated Curvature for Stochastic Optimization [20.189732632410024]
The quasi-Newton method provides curvature information by approximating the Hessian using the secant equation.
We propose an approximate Newton step-based DP optimization algorithm for large-scale empirical risk of convex functions with linear convergence rates.
arXiv Detail & Related papers (2021-10-16T14:04:51Z) - A Momentum-Assisted Single-Timescale Stochastic Approximation Algorithm
for Bilevel Optimization [112.59170319105971]
We propose a new algorithm -- the Momentum- Single-timescale Approximation (MSTSA) -- for tackling problems.
MSTSA allows us to control the error in iterations due to inaccurate solution to the lower level subproblem.
arXiv Detail & Related papers (2021-02-15T07:10:33Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.