A Variance Controlled Stochastic Method with Biased Estimation for
Faster Non-convex Optimization
- URL: http://arxiv.org/abs/2102.09893v1
- Date: Fri, 19 Feb 2021 12:22:56 GMT
- Title: A Variance Controlled Stochastic Method with Biased Estimation for
Faster Non-convex Optimization
- Authors: Jia Bi, Steve R.Gunn
- Abstract summary: We propose a new technique, em variance controlled gradient (VCSG), to improve the performance of the reduced gradient (SVRG)
$lambda$ is introduced in VCSG to avoid over-reducing a variance by SVRG.
$mathcalO(min1/epsilon3/2,n1/4/epsilon)$ number of gradient evaluations, which improves the leading gradient complexity.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we proposed a new technique, {\em variance controlled
stochastic gradient} (VCSG), to improve the performance of the stochastic
variance reduced gradient (SVRG) algorithm. To avoid over-reducing the variance
of gradient by SVRG, a hyper-parameter $\lambda$ is introduced in VCSG that is
able to control the reduced variance of SVRG. Theory shows that the
optimization method can converge by using an unbiased gradient estimator, but
in practice, biased gradient estimation can allow more efficient convergence to
the vicinity since an unbiased approach is computationally more expensive.
$\lambda$ also has the effect of balancing the trade-off between unbiased and
biased estimations. Secondly, to minimize the number of full gradient
calculations in SVRG, a variance-bounded batch is introduced to reduce the
number of gradient calculations required in each iteration. For smooth
non-convex functions, the proposed algorithm converges to an approximate
first-order stationary point (i.e.
$\mathbb{E}\|\nabla{f}(x)\|^{2}\leq\epsilon$) within
$\mathcal{O}(min\{1/\epsilon^{3/2},n^{1/4}/\epsilon\})$ number of stochastic
gradient evaluations, which improves the leading gradient complexity of
stochastic gradient-based method SCS
$(\mathcal{O}(min\{1/\epsilon^{5/3},n^{2/3}/\epsilon\})$. It is shown
theoretically and experimentally that VCSG can be deployed to improve
convergence.
Related papers
- Double Variance Reduction: A Smoothing Trick for Composite Optimization Problems without First-Order Gradient [40.22217106270146]
Variance reduction techniques are designed to decrease the sampling variance, thereby accelerating convergence rates of first-order (FO) and zeroth-order (ZO) optimization methods.
In composite optimization problems, ZO methods encounter an additional variance called the coordinate-wise variance, which stems from the random estimation.
This paper proposes the Zeroth-order Proximal Double Variance Reduction (ZPDVR) method, which utilizes the averaging trick to reduce both sampling and coordinate-wise variances.
arXiv Detail & Related papers (2024-05-28T02:27:53Z) - Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance [55.01966743652196]
We propose a novel algorithm for distributed gradient descent (SGD) with compressed gradient communication in the parameter-server framework.
Our gradient compression technique, named flattened one-bit gradient descent (FO-SGD), relies on two simple algorithmic ideas.
arXiv Detail & Related papers (2024-05-17T21:17:27Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - High-probability Bounds for Non-Convex Stochastic Optimization with
Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails.
We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z) - Stochastic Bias-Reduced Gradient Methods [44.35885731095432]
We develop a new primitive for optimization: a low-bias, low-cost smoothing of ther $x_star$ of any bound of the Moreau-Yoshida function.
arXiv Detail & Related papers (2021-06-17T13:33:05Z) - Improving the Transient Times for Distributed Stochastic Gradient
Methods [5.215491794707911]
We study a distributed gradient algorithm, called exact diffusion adaptive stepsizes (EDAS)
We show EDAS achieves the same network independent convergence rate as centralized gradient descent (SGD)
To the best of our knowledge, EDAS achieves the shortest time when the average of the $n$ cost functions is strongly convex.
arXiv Detail & Related papers (2021-05-11T08:09:31Z) - Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box
Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information.
We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z) - Variance Reduction with Sparse Gradients [82.41780420431205]
Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients.
We introduce a new sparsity operator: The random-top-k operator.
Our algorithm consistently outperforms SpiderBoost on various tasks including image classification, natural language processing, and sparse matrix factorization.
arXiv Detail & Related papers (2020-01-27T08:23:58Z) - Stochastic gradient-free descents [8.663453034925363]
We propose gradient-free methods and accelerated gradients with momentum for solving optimization problems.
We analyze the convergence behavior of these methods under the mean-variance framework.
arXiv Detail & Related papers (2019-12-31T13:56:36Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.