Nonlinear gradient mappings and stochastic optimization: A general
framework with applications to heavy-tail noise
- URL: http://arxiv.org/abs/2204.02593v1
- Date: Wed, 6 Apr 2022 06:05:52 GMT
- Title: Nonlinear gradient mappings and stochastic optimization: A general
framework with applications to heavy-tail noise
- Authors: Dusan Jakovetic, Dragana Bajovic, Anit Kumar Sahu, Soummya Kar,
Nemanja Milosevic, Dusan Stamenkovic
- Abstract summary: We introduce a general framework for nonlinear gradient descent scenarios when gradient noise exhibits heavy tails.
We show that for a nonlinearity with bounded outputs and for the gradient noise that may not have finite moments of order greater than one, the nonlinear SGD converges to zero at rate$O(/tzeta)$, $zeta in (0,1)$.
Experiments show that, while our framework is more general than existing studies of SGD under heavy-tail noise, several easy-to-implement nonlinearities from our framework are competitive with state of the art alternatives on real data sets
- Score: 11.768495184175052
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a general framework for nonlinear stochastic gradient descent
(SGD) for the scenarios when gradient noise exhibits heavy tails. The proposed
framework subsumes several popular nonlinearity choices, like clipped,
normalized, signed or quantized gradient, but we also consider novel
nonlinearity choices. We establish for the considered class of methods strong
convergence guarantees assuming a strongly convex cost function with Lipschitz
continuous gradients under very general assumptions on the gradient noise. Most
notably, we show that, for a nonlinearity with bounded outputs and for the
gradient noise that may not have finite moments of order greater than one, the
nonlinear SGD's mean squared error (MSE), or equivalently, the expected cost
function's optimality gap, converges to zero at rate~$O(1/t^\zeta)$, $\zeta \in
(0,1)$. In contrast, for the same noise setting, the linear SGD generates a
sequence with unbounded variances. Furthermore, for the nonlinearities that can
be decoupled component wise, like, e.g., sign gradient or component-wise
clipping, we show that the nonlinear SGD asymptotically (locally) achieves a
$O(1/t)$ rate in the weak convergence sense and explicitly quantify the
corresponding asymptotic variance. Experiments show that, while our framework
is more general than existing studies of SGD under heavy-tail noise, several
easy-to-implement nonlinearities from our framework are competitive with state
of the art alternatives on real data sets with heavy tail noises.
Related papers
- Gradient Normalization with(out) Clipping Ensures Convergence of Nonconvex SGD under Heavy-Tailed Noise with Improved Results [60.92029979853314]
This paper investigates Gradient Normalization without (NSGDC) its gradient reduction variant (NSGDC-VR)
We present significant improvements in the theoretical results for both algorithms.
arXiv Detail & Related papers (2024-10-21T22:40:42Z) - Large Deviations and Improved Mean-squared Error Rates of Nonlinear SGD: Heavy-tailed Noise and Power of Symmetry [47.653744900375855]
We study large deviations and mean-squared error (MSE) guarantees a general framework of nonlinear convex gradient methods in the online setting.
We provide strong results for broad bound step-sizes in the presence of heavy noise symmetric density function.
arXiv Detail & Related papers (2024-10-21T04:50:57Z) - Nonlinear Stochastic Gradient Descent and Heavy-tailed Noise: A Unified Framework and High-probability Guarantees [56.80920351680438]
We study high-probability convergence in online learning, in the presence of heavy-tailed noise.
Compared to state-of-the-art, who only consider clipping and require noise with bounded $p$-th moments, we provide guarantees for a broad class of nonlinearities.
arXiv Detail & Related papers (2024-10-17T18:25:28Z) - Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions [18.47705532817026]
We show that AdaGrad outperforms SGD by a factor of $d$ under certain conditions.
Motivated by this, we introduce assumptions on the smoothness structure of the objective and the gradient variance.
arXiv Detail & Related papers (2024-06-07T02:55:57Z) - Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems [56.86067111855056]
We consider clipped optimization problems with heavy-tailed noise with structured density.
We show that it is possible to get faster rates of convergence than $mathcalO(K-(alpha - 1)/alpha)$, when the gradients have finite moments of order.
We prove that the resulting estimates have negligible bias and controllable variance.
arXiv Detail & Related papers (2023-11-07T17:39:17Z) - High-probability Convergence Bounds for Nonlinear Stochastic Gradient Descent Under Heavy-tailed Noise [59.25598762373543]
We show that wetailed high-prob convergence guarantees of learning on streaming data in the presence of heavy-tailed noise.
We demonstrate analytically and that $ta$ can be used to the preferred choice of setting for a given problem.
arXiv Detail & Related papers (2023-10-28T18:53:41Z) - Convex and Non-convex Optimization Under Generalized Smoothness [69.69521650503431]
An analysis of convex and non- optimization methods often requires the Lipsitzness gradient, which limits the analysis by this trajectorys.
Recent work generalizes the gradient setting via the non-uniform smoothness condition.
arXiv Detail & Related papers (2023-06-02T04:21:59Z) - The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded
Gradients and Affine Variance [46.15915820243487]
We show that AdaGrad-Norm exhibits an order optimal convergence of $mathcalOleft.
We show that AdaGrad-Norm exhibits an order optimal convergence of $mathcalOleft.
arXiv Detail & Related papers (2022-02-11T17:37:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.