Nys-Curve: Nystr\"om-Approximated Curvature for Stochastic Optimization
- URL: http://arxiv.org/abs/2110.08577v1
- Date: Sat, 16 Oct 2021 14:04:51 GMT
- Title: Nys-Curve: Nystr\"om-Approximated Curvature for Stochastic Optimization
- Authors: Hardik Tankaria, Dinesh Singh, Makoto Yamada
- Abstract summary: The quasi-Newton method provides curvature information by approximating the Hessian using the secant equation.
We propose an approximate Newton step-based DP optimization algorithm for large-scale empirical risk of convex functions with linear convergence rates.
- Score: 20.189732632410024
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The quasi-Newton methods generally provide curvature information by
approximating the Hessian using the secant equation. However, the secant
equation becomes insipid in approximating the Newton step owing to its use of
the first-order derivatives. In this study, we propose an approximate Newton
step-based stochastic optimization algorithm for large-scale empirical risk
minimization of convex functions with linear convergence rates. Specifically,
we compute a partial column Hessian of size ($d\times k$) with $k\ll d$
randomly selected variables, then use the \textit{Nystr\"om method} to better
approximate the full Hessian matrix. To further reduce the computational
complexity per iteration, we directly compute the update step
($\Delta\boldsymbol{w}$) without computing and storing the full Hessian or its
inverse. Furthermore, to address large-scale scenarios in which even computing
a partial Hessian may require significant time, we used distribution-preserving
(DP) sub-sampling to compute a partial Hessian. The DP sub-sampling generates
$p$ sub-samples with similar first and second-order distribution statistics and
selects a single sub-sample at each epoch in a round-robin manner to compute
the partial Hessian. We integrate our approximated Hessian with stochastic
gradient descent and stochastic variance-reduced gradients to solve the
logistic regression problem. The numerical experiments show that the proposed
approach was able to obtain a better approximation of Newton\textquotesingle s
method with performance competitive with the state-of-the-art first-order and
the stochastic quasi-Newton methods.
Related papers
- Fast Unconstrained Optimization via Hessian Averaging and Adaptive Gradient Sampling Methods [0.3222802562733786]
We consider minimizing finite-sum expectation objective functions via Hessian-averaging based subsampled Newton methods.
These methods allow for inexactness and have fixed per-it Hessian approximation costs.
We present novel analysis techniques and propose challenges for their practical implementation.
arXiv Detail & Related papers (2024-08-14T03:27:48Z) - Double Variance Reduction: A Smoothing Trick for Composite Optimization Problems without First-Order Gradient [40.22217106270146]
Variance reduction techniques are designed to decrease the sampling variance, thereby accelerating convergence rates of first-order (FO) and zeroth-order (ZO) optimization methods.
In composite optimization problems, ZO methods encounter an additional variance called the coordinate-wise variance, which stems from the random estimation.
This paper proposes the Zeroth-order Proximal Double Variance Reduction (ZPDVR) method, which utilizes the averaging trick to reduce both sampling and coordinate-wise variances.
arXiv Detail & Related papers (2024-05-28T02:27:53Z) - Statistical Inference of Constrained Stochastic Optimization via Sketched Sequential Quadratic Programming [53.63469275932989]
We consider online statistical inference of constrained nonlinear optimization problems.
We apply the Sequential Quadratic Programming (StoSQP) method to solve these problems.
arXiv Detail & Related papers (2022-05-27T00:34:03Z) - Hessian Averaging in Stochastic Newton Methods Achieves Superlinear
Convergence [69.65563161962245]
We consider a smooth and strongly convex objective function using a Newton method.
We show that there exists a universal weighted averaging scheme that transitions to local convergence at an optimal stage.
arXiv Detail & Related papers (2022-04-20T07:14:21Z) - Convergence of First-Order Methods for Constrained Nonconvex
Optimization with Dependent Data [7.513100214864646]
We show the worst-case complexity of convergence $tildeO(t-1/4)$ and MoreautildeO(vareps-4)$ for smooth non- optimization.
We obtain first online nonnegative matrix factorization algorithms for dependent data based on projected gradient methods with adaptive step sizes and optimal convergence.
arXiv Detail & Related papers (2022-03-29T17:59:10Z) - Faster One-Sample Stochastic Conditional Gradient Method for Composite
Convex Minimization [61.26619639722804]
We propose a conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms.
The proposed method, equipped with an average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques.
arXiv Detail & Related papers (2022-02-26T19:10:48Z) - Newton-LESS: Sparsification without Trade-offs for the Sketched Newton
Update [88.73437209862891]
In second-order optimization, a potential bottleneck can be computing the Hessian matrix of the optimized function at every iteration.
We show that the Gaussian sketching matrix can be drastically sparsified, significantly reducing the computational cost of sketching.
We prove that Newton-LESS enjoys nearly the same problem-independent local convergence rate as Gaussian embeddings.
arXiv Detail & Related papers (2021-07-15T17:33:05Z) - Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and
Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP)
We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z) - Carath\'eodory Sampling for Stochastic Gradient Descent [79.55586575988292]
We present an approach that is inspired by classical results of Tchakaloff and Carath'eodory about measure reduction.
We adaptively select the descent steps where the measure reduction is carried out.
We combine this with Block Coordinate Descent so that measure reduction can be done very cheaply.
arXiv Detail & Related papers (2020-06-02T17:52:59Z) - Stochastic Subspace Cubic Newton Method [14.624340432672172]
We propose a new randomized second-order optimization algorithm for minimizing a high dimensional convex function $f$.
We prove that as we vary the minibatch size, the global convergence rate of SSCN interpolates between the rate of coordinate descent (CD) and the rate of cubic regularized Newton.
Remarkably, the local convergence rate of SSCN matches the rate of subspace descent applied to the problem of minimizing the quadratic function $frac12 (x-x*)top nabla2f(x*)(x-x*)$, where $x
arXiv Detail & Related papers (2020-02-21T19:42:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.