Variance-reduced Clipping for Non-convex Optimization
- URL: http://arxiv.org/abs/2303.00883v2
- Date: Fri, 2 Jun 2023 23:35:16 GMT
- Title: Variance-reduced Clipping for Non-convex Optimization
- Authors: Amirhossein Reisizadeh, Haochuan Li, Subhro Das, Ali Jadbabaie
- Abstract summary: Gradient clipping is a technique used in deep learning applications such as large-scale language modeling.
Recent experimental training have a fairly special behavior in that it mitigates order complexity.
- Score: 24.765794811146144
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Gradient clipping is a standard training technique used in deep learning
applications such as large-scale language modeling to mitigate exploding
gradients. Recent experimental studies have demonstrated a fairly special
behavior in the smoothness of the training objective along its trajectory when
trained with gradient clipping. That is, the smoothness grows with the gradient
norm. This is in clear contrast to the well-established assumption in folklore
non-convex optimization, a.k.a. $L$--smoothness, where the smoothness is
assumed to be bounded by a constant $L$ globally. The recently introduced
$(L_0,L_1)$--smoothness is a more relaxed notion that captures such behavior in
non-convex optimization. In particular, it has been shown that under this
relaxed smoothness assumption, SGD with clipping requires $O(\epsilon^{-4})$
stochastic gradient computations to find an $\epsilon$--stationary solution. In
this paper, we employ a variance reduction technique, namely SPIDER, and
demonstrate that for a carefully designed learning rate, this complexity is
improved to $O(\epsilon^{-3})$ which is order-optimal. Our designed learning
rate comprises the clipping technique to mitigate the growing smoothness.
Moreover, when the objective function is the average of $n$ components, we
improve the existing $O(n\epsilon^{-2})$ bound on the stochastic gradient
complexity to $O(\sqrt{n} \epsilon^{-2} + n)$, which is order-optimal as well.
In addition to being theoretically optimal, SPIDER with our designed parameters
demonstrates comparable empirical performance against variance-reduced methods
such as SVRG and SARAH in several vision tasks.
Related papers
- Methods for Convex $(L_0,L_1)$-Smooth Optimization: Clipping, Acceleration, and Adaptivity [50.25258834153574]
We focus on the class of (strongly) convex $(L0)$-smooth functions and derive new convergence guarantees for several existing methods.
In particular, we derive improved convergence rates for Gradient Descent with smoothnessed Gradient Clipping and for Gradient Descent with Polyak Stepsizes.
arXiv Detail & Related papers (2024-09-23T13:11:37Z) - Differential Private Stochastic Optimization with Heavy-tailed Data: Towards Optimal Rates [15.27596975662702]
We explore algorithms achieving optimal rates of DP optimization with heavy-tailed gradients.
Our results match the minimax lower bound in citekamath2022, indicating that the theoretical limit of convex optimization under DP is achievable.
arXiv Detail & Related papers (2024-08-19T11:07:05Z) - Universal Online Learning with Gradient Variations: A Multi-layer Online Ensemble Approach [57.92727189589498]
We propose an online convex optimization approach with two different levels of adaptivity.
We obtain $mathcalO(log V_T)$, $mathcalO(d log V_T)$ and $hatmathcalO(sqrtV_T)$ regret bounds for strongly convex, exp-concave and convex loss functions.
arXiv Detail & Related papers (2023-07-17T09:55:35Z) - On Convergence of Incremental Gradient for Non-Convex Smooth Functions [63.51187646914962]
In machine learning and network optimization, algorithms like shuffle SGD are popular due to minimizing the number of misses and good cache.
This paper delves into the convergence properties SGD algorithms with arbitrary data ordering.
arXiv Detail & Related papers (2023-05-30T17:47:27Z) - Optimal Stochastic Non-smooth Non-convex Optimization through
Online-to-Non-convex Conversion [56.92236659731376]
We present new algorithms for optimizing non-known, non-smooth objectives based on a novel analysis technique.
For deterministic second-order smooth objectives, applying advanced optimistic online learning techniques enables a new $O(delta0.5) all$ to recover optimal or best-known results.
arXiv Detail & Related papers (2023-02-07T22:09:20Z) - Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with
Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems.
In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z) - Stochastic Bias-Reduced Gradient Methods [44.35885731095432]
We develop a new primitive for optimization: a low-bias, low-cost smoothing of ther $x_star$ of any bound of the Moreau-Yoshida function.
arXiv Detail & Related papers (2021-06-17T13:33:05Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.