Related papers: The Geometry of Sign Gradient Descent

The Geometry of Sign Gradient Descent

URL: http://arxiv.org/abs/2002.08056v1
Date: Wed, 19 Feb 2020 08:45:54 GMT
Title: The Geometry of Sign Gradient Descent
Authors: Lukas Balles and Fabian Pedregosa and Nicolas Le Roux
Abstract summary: We show a close connection between separable smoothness and $ell_infty$-smoothness and argue that the latter is the weaker and more natural assumption. We then proceed to study the smoothness constant with respect to the $ell_infty$-norm and thereby isolate geometric properties of the objective function. In short, we find sign-based methods to be preferable over gradient descent if (i) the Hessian is to some degree concentrated on its diagonal, and (ii) its maximal eigenvalue is much larger than the average eigenvalue.
Score: 29.8753797565422
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sign-based optimization methods have become popular in machine learning due to their favorable communication cost in distributed optimization and their surprisingly good performance in neural network training. Furthermore, they are closely connected to so-called adaptive gradient methods like Adam. Recent works on signSGD have used a non-standard "separable smoothness" assumption, whereas some older works study sign gradient descent as steepest descent with respect to the $\ell_\infty$-norm. In this work, we unify these existing results by showing a close connection between separable smoothness and $\ell_\infty$-smoothness and argue that the latter is the weaker and more natural assumption. We then proceed to study the smoothness constant with respect to the $\ell_\infty$-norm and thereby isolate geometric properties of the objective function which affect the performance of sign-based methods. In short, we find sign-based methods to be preferable over gradient descent if (i) the Hessian is to some degree concentrated on its diagonal, and (ii) its maximal eigenvalue is much larger than the average eigenvalue. Both properties are common in deep networks.

Related papers

Methods for Convex $(L_0,L_1)$-Smooth Optimization: Clipping, Acceleration, and Adaptivity [50.25258834153574]
We focus on the class of (strongly) convex $(L0)$-smooth functions and derive new convergence guarantees for several existing methods. In particular, we derive improved convergence rates for Gradient Descent with smoothnessed Gradient Clipping and for Gradient Descent with Polyak Stepsizes.
arXiv Detail & Related papers (2024-09-23T13:11:37Z)
Neural Gradient Learning and Optimization for Oriented Point Normal Estimation [53.611206368815125]
We propose a deep learning approach to learn gradient vectors with consistent orientation from 3D point clouds for normal estimation. We learn an angular distance field based on local plane geometry to refine the coarse gradient vectors. Our method efficiently conducts global gradient approximation while achieving better accuracy and ability generalization of local feature description.
arXiv Detail & Related papers (2023-09-17T08:35:11Z)
Convergence of Adam Under Relaxed Assumptions [72.24779199744954]
We show that Adam converges to $epsilon$-stationary points with $O(epsilon-4)$ gradient complexity under far more realistic conditions. We also propose a variance-reduced version of Adam with an accelerated gradient complexity of $O(epsilon-3)$.
arXiv Detail & Related papers (2023-04-27T06:27:37Z)
Variance-reduced Clipping for Non-convex Optimization [24.765794811146144]
Gradient clipping is a technique used in deep learning applications such as large-scale language modeling. Recent experimental training have a fairly special behavior in that it mitigates order complexity.
arXiv Detail & Related papers (2023-03-02T00:57:38Z)
Towards More Robust Interpretation via Local Gradient Alignment [37.464250451280336]
We show that for every non-negative homogeneous neural network, a naive $ell$-robust criterion for gradients is textitnot normalization invariant. We propose to combine both $ell$ and cosine distance-based criteria as regularization terms to leverage the advantages of both in aligning the local gradient. We experimentally show that models trained with our method produce much more robust interpretations on CIFAR-10 and ImageNet-100.
arXiv Detail & Related papers (2022-11-29T03:38:28Z)
How Does Adaptive Optimization Impact Local Neural Network Geometry? [32.32593743852949]
We argue that in the context of neural network optimization, this traditional viewpoint is insufficient. We show that adaptive methods such as Adam bias the trajectories towards regions where one might expect faster convergence.
arXiv Detail & Related papers (2022-11-04T04:05:57Z)
Fast Gradient Non-sign Methods [67.56549792690706]
Fast Gradient Non-sign Method (FGNM) is a general routine, which can seamlessly replace the conventional $sign$ operation in gradient-based attacks. Our methods outperform them by textbf27.5% at most and textbf9.5% on average.
arXiv Detail & Related papers (2021-10-25T08:46:00Z)
Leveraging Non-uniformity in First-order Non-convex Optimization [93.6817946818977]
Non-uniform refinement of objective functions leads to emphNon-uniform Smoothness (NS) and emphNon-uniform Lojasiewicz inequality (NL) New definitions inspire new geometry-aware first-order methods that converge to global optimality faster than the classical $Omega (1/t2)$ lower bounds.
arXiv Detail & Related papers (2021-05-13T04:23:07Z)
Training Two-Layer ReLU Networks with Gradient Descent is Inconsistent [2.7793394375935088]
We prove that two-layer (Leaky)ReLU networks by e.g. the widely used method proposed by He et al. are not consistent.
arXiv Detail & Related papers (2020-02-12T09:22:45Z)
Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks. In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems. Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.