Spherical Perspective on Learning with Normalization Layers
- URL: http://arxiv.org/abs/2006.13382v3
- Date: Thu, 19 May 2022 13:29:31 GMT
- Title: Spherical Perspective on Learning with Normalization Layers
- Authors: Simon Roburin, Yann de Mont-Marin, Andrei Bursuc, Renaud Marlet,
Patrick P\'erez, Mathieu Aubry
- Abstract summary: Normalization Layers (NLs) are widely used in modern deep-learning architectures.
This paper introduces a spherical framework to study the optimization of neural networks with NLs from a geometric perspective.
- Score: 28.10737477667422
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Normalization Layers (NLs) are widely used in modern deep-learning
architectures. Despite their apparent simplicity, their effect on optimization
is not yet fully understood. This paper introduces a spherical framework to
study the optimization of neural networks with NLs from a geometric
perspective. Concretely, the radial invariance of groups of parameters, such as
filters for convolutional neural networks, allows to translate the optimization
steps on the $L_2$ unit hypersphere. This formulation and the associated
geometric interpretation shed new light on the training dynamics. Firstly, the
first effective learning rate expression of Adam is derived. Then the
demonstration that, in the presence of NLs, performing Stochastic Gradient
Descent (SGD) alone is actually equivalent to a variant of Adam constrained to
the unit hypersphere, stems from the framework. Finally, this analysis outlines
phenomena that previous variants of Adam act on and their importance in the
optimization process are experimentally validated.
Related papers
- Convergence of Implicit Gradient Descent for Training Two-Layer Physics-Informed Neural Networks [3.680127959836384]
implicit gradient descent (IGD) outperforms the common gradient descent (GD) in handling certain multi-scale problems.
We show that IGD converges a globally optimal solution at a linear convergence rate.
arXiv Detail & Related papers (2024-07-03T06:10:41Z) - The Convex Landscape of Neural Networks: Characterizing Global Optima
and Stationary Points via Lasso Models [75.33431791218302]
Deep Neural Network Network (DNN) models are used for programming purposes.
In this paper we examine the use of convex neural recovery models.
We show that all the stationary non-dimensional objective objective can be characterized as the standard a global subsampled convex solvers program.
We also show that all the stationary non-dimensional objective objective can be characterized as the standard a global subsampled convex solvers program.
arXiv Detail & Related papers (2023-12-19T23:04:56Z) - A Unified Algebraic Perspective on Lipschitz Neural Networks [88.14073994459586]
This paper introduces a novel perspective unifying various types of 1-Lipschitz neural networks.
We show that many existing techniques can be derived and generalized via finding analytical solutions of a common semidefinite programming (SDP) condition.
Our approach, called SDP-based Lipschitz Layers (SLL), allows us to design non-trivial yet efficient generalization of convex potential layers.
arXiv Detail & Related papers (2023-03-06T14:31:09Z) - How Does Adaptive Optimization Impact Local Neural Network Geometry? [32.32593743852949]
We argue that in the context of neural network optimization, this traditional viewpoint is insufficient.
We show that adaptive methods such as Adam bias the trajectories towards regions where one might expect faster convergence.
arXiv Detail & Related papers (2022-11-04T04:05:57Z) - Training Scale-Invariant Neural Networks on the Sphere Can Happen in
Three Regimes [3.808063547958558]
We study the properties of training scale-invariant neural networks directly on the sphere using a fixed ELR.
We discover three regimes of such training depending on the ELR value: convergence, chaotic equilibrium, and divergence.
arXiv Detail & Related papers (2022-09-08T10:30:05Z) - Understanding the Generalization of Adam in Learning Neural Networks
with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization.
We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z) - Leveraging Non-uniformity in First-order Non-convex Optimization [93.6817946818977]
Non-uniform refinement of objective functions leads to emphNon-uniform Smoothness (NS) and emphNon-uniform Lojasiewicz inequality (NL)
New definitions inspire new geometry-aware first-order methods that converge to global optimality faster than the classical $Omega (1/t2)$ lower bounds.
arXiv Detail & Related papers (2021-05-13T04:23:07Z) - Training Sparse Neural Network by Constraining Synaptic Weight on Unit
Lp Sphere [2.429910016019183]
constraining the synaptic weights on unit Lp-sphere enables the flexibly control of the sparsity with p.
Our approach is validated by experiments on benchmark datasets covering a wide range of domains.
arXiv Detail & Related papers (2021-03-30T01:02:31Z) - Improving the Backpropagation Algorithm with Consequentialism Weight
Updates over Mini-Batches [0.40611352512781856]
We show that it is possible to consider a multi-layer neural network as a stack of adaptive filters.
We introduce a better algorithm by predicting then emending the adverse consequences of the actions that take place in BP even before they happen.
Our experiments show the usefulness of our algorithm in the training of deep neural networks.
arXiv Detail & Related papers (2020-03-11T08:45:36Z) - Layer-wise Conditioning Analysis in Exploring the Learning Dynamics of
DNNs [115.35745188028169]
We extend conditioning analysis to deep neural networks (DNNs) in order to investigate their learning dynamics.
We show that batch normalization (BN) can stabilize the training, but sometimes result in the false impression of a local minimum.
We experimentally observe that BN can improve the layer-wise conditioning of the optimization problem.
arXiv Detail & Related papers (2020-02-25T11:40:27Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.