Improved Analysis of Clipping Algorithms for Non-convex Optimization
- URL: http://arxiv.org/abs/2010.02519v2
- Date: Thu, 29 Oct 2020 03:04:32 GMT
- Title: Improved Analysis of Clipping Algorithms for Non-convex Optimization
- Authors: Bohang Zhang and Jikai Jin and Cong Fang and Liwei Wang
- Abstract summary: Recently, citetzhang 2019gradient show that clipped (stochastic) Gradient Descent (GD) converges faster than vanilla GD/SGD.
Experiments confirm the superiority of clipping-based methods in deep learning tasks.
- Score: 19.507750439784605
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Gradient clipping is commonly used in training deep neural networks partly
due to its practicability in relieving the exploding gradient problem.
Recently, \citet{zhang2019gradient} show that clipped (stochastic) Gradient
Descent (GD) converges faster than vanilla GD/SGD via introducing a new
assumption called $(L_0, L_1)$-smoothness, which characterizes the violent
fluctuation of gradients typically encountered in deep neural networks.
However, their iteration complexities on the problem-dependent parameters are
rather pessimistic, and theoretical justification of clipping combined with
other crucial techniques, e.g. momentum acceleration, are still lacking. In
this paper, we bridge the gap by presenting a general framework to study the
clipping algorithms, which also takes momentum methods into consideration. We
provide convergence analysis of the framework in both deterministic and
stochastic setting, and demonstrate the tightness of our results by comparing
them with existing lower bounds. Our results imply that the efficiency of
clipping methods will not degenerate even in highly non-smooth regions of the
landscape. Experiments confirm the superiority of clipping-based methods in
deep learning tasks.
Related papers
- Gradient-Variation Online Learning under Generalized Smoothness [56.38427425920781]
gradient-variation online learning aims to achieve regret guarantees that scale with variations in gradients of online functions.
Recent efforts in neural network optimization suggest a generalized smoothness condition, allowing smoothness to correlate with gradient norms.
We provide the applications for fast-rate convergence in games and extended adversarial optimization.
arXiv Detail & Related papers (2024-08-17T02:22:08Z) - Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - High Probability Analysis for Non-Convex Stochastic Optimization with
Clipping [13.025261730510847]
gradient clipping is a technique for dealing with the heavy-tailed neural networks.
Most theoretical guarantees only provide an in-expectation analysis and only on the performance.
Our analysis provides a relatively complete picture for the theoretical guarantee of optimization algorithms with gradient clipping.
arXiv Detail & Related papers (2023-07-25T17:36:56Z) - Implicit regularization in AI meets generalized hardness of
approximation in optimization -- Sharp results for diagonal linear networks [0.0]
We show sharp results for the implicit regularization imposed by the gradient flow of Diagonal Linear Networks.
We link this to the phenomenon of phase transitions in generalized hardness of approximation.
Non-sharpness of our results would imply that the GHA phenomenon would not occur for the basis pursuit optimization problem.
arXiv Detail & Related papers (2023-07-13T13:27:51Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Proxy Convexity: A Unified Framework for the Analysis of Neural Networks
Trained by Gradient Descent [95.94432031144716]
We propose a unified non- optimization framework for the analysis of a learning network.
We show that existing guarantees can be trained unified through gradient descent.
arXiv Detail & Related papers (2021-06-25T17:45:00Z) - Adaptive Learning Rate and Momentum for Training Deep Neural Networks [0.0]
We develop a fast training method motivated by the nonlinear Conjugate Gradient (CG) framework.
Experiments in image classification datasets show that our method yields faster convergence than other local solvers.
arXiv Detail & Related papers (2021-06-22T05:06:56Z) - Stability and Convergence of Stochastic Gradient Clipping: Beyond
Lipschitz Continuity and Smoothness [23.22461721824713]
Gradient clipping is a technique to stabilize the training process for problems prone to the exploding gradient problem.
This paper establishes both qualitative and quantitative results of the gradient clipped (sub)gradient method (SGD) for non-smooth convex functions.
We also study the convergence of a clipped method with momentum, which includes clipped SGD as a special case.
arXiv Detail & Related papers (2021-02-12T12:41:42Z) - Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem.
We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent.
Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.