Correcting Momentum with Second-order Information
- URL: http://arxiv.org/abs/2103.03265v1
- Date: Thu, 4 Mar 2021 19:01:20 GMT
- Title: Correcting Momentum with Second-order Information
- Authors: Hoang Tran, Ashok Cutkosky
- Abstract summary: We develop a new algorithm for non-critical optimization that finds an $O(epsilon)$epsilon point in the optimal product.
We validate our results on a variety of large-scale deep learning benchmarks and architectures.
- Score: 50.992629498861724
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We develop a new algorithm for non-convex stochastic optimization that finds
an $\epsilon$-critical point in the optimal $O(\epsilon^{-3})$ stochastic
gradient and hessian-vector product computations. Our algorithm uses
Hessian-vector products to "correct" a bias term in the momentum of SGD with
momentum. This leads to better gradient estimates in a manner analogous to
variance reduction methods. In contrast to prior work, we do not require
excessively large batch sizes (or indeed any restrictions at all on the batch
size), and both our algorithm and its analysis are much simpler. We validate
our results on a variety of large-scale deep learning benchmarks and
architectures, where we see improvements over SGD and Adam.
Related papers
- Improved Rate of First Order Algorithms for Entropic Optimal Transport [2.1485350418225244]
This paper improves the state-of-the-art rate of a first-order algorithm for solving entropy regularized optimal transport.
We propose an accelerated primal-dual mirror descent algorithm with variance reduction.
Our algorithm may inspire more research to develop accelerated primal-dual algorithms that have rate $widetildeO(n2/epsilon)$ for solving OT.
arXiv Detail & Related papers (2023-01-23T19:13:25Z) - Formal guarantees for heuristic optimization algorithms used in machine
learning [6.978625807687497]
Gradient Descent (SGD) and its variants have become the dominant methods in the large-scale optimization machine learning (ML) problems.
We provide formal guarantees of a few convex optimization methods and proposing improved algorithms.
arXiv Detail & Related papers (2022-07-31T19:41:22Z) - Large-Scale Methods for Distributionally Robust Optimization [53.98643772533416]
We prove that our algorithms require a number of evaluations gradient independent of training set size and number of parameters.
Experiments on MNIST and ImageNet confirm the theoretical scaling of our algorithms, which are 9--36 times more efficient than full-batch methods.
arXiv Detail & Related papers (2020-10-12T17:41:44Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z) - Private Stochastic Non-Convex Optimization: Adaptive Algorithms and
Tighter Generalization Bounds [72.63031036770425]
We propose differentially private (DP) algorithms for bound non-dimensional optimization.
We demonstrate two popular deep learning methods on the empirical advantages over standard gradient methods.
arXiv Detail & Related papers (2020-06-24T06:01:24Z) - Gradient Free Minimax Optimization: Variance Reduction and Faster
Convergence [120.9336529957224]
In this paper, we denote the non-strongly setting on the magnitude of a gradient-free minimax optimization problem.
We show that a novel zeroth-order variance reduced descent algorithm achieves the best known query complexity.
arXiv Detail & Related papers (2020-06-16T17:55:46Z) - Robust estimation via generalized quasi-gradients [28.292300073453877]
We show why many recently proposed robust estimation problems are efficiently solvable.
We identify the existence of "generalized quasi-gradients"
We show that generalized quasi-gradients exist and construct efficient algorithms.
arXiv Detail & Related papers (2020-05-28T15:14:33Z) - Variance Reduction with Sparse Gradients [82.41780420431205]
Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients.
We introduce a new sparsity operator: The random-top-k operator.
Our algorithm consistently outperforms SpiderBoost on various tasks including image classification, natural language processing, and sparse matrix factorization.
arXiv Detail & Related papers (2020-01-27T08:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.