Related papers: Towards Accelerating Training of Batch Normalization: A Manifold Perspective

Towards Accelerating Training of Batch Normalization: A Manifold Perspective

URL: http://arxiv.org/abs/2101.02916v1
Date: Fri, 8 Jan 2021 08:53:07 GMT
Title: Towards Accelerating Training of Batch Normalization: A Manifold Perspective
Authors: Mingyang Yi, Qi Meng, Wei Chen, Zhi-Ming Ma
Abstract summary: Batch normalization (BN) has become a crucial component across diverse deep neural networks. We propose a quotient manifold emphPSI manifold, in which all the equivalent weights of the network with BN are regarded as the same one element. We show that our algorithms can consistently achieve better performances over various experimental settings.
Score: 19.55158964644964
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Batch normalization (BN) has become a crucial component across diverse deep neural networks. The network with BN is invariant to positively linear re-scaling of weights, which makes there exist infinite functionally equivalent networks with various scales of weights. However, optimizing these equivalent networks with the first-order method such as stochastic gradient descent will converge to different local optima owing to different gradients across training. To alleviate this, we propose a quotient manifold \emph{PSI manifold}, in which all the equivalent weights of the network with BN are regarded as the same one element. Then, gradient descent and stochastic gradient descent on the PSI manifold are also constructed. The two algorithms guarantee that every group of equivalent weights (caused by positively re-scaling) converge to the equivalent optima. Besides that, we give the convergence rate of the proposed algorithms on PSI manifold and justify that they accelerate training compared with the algorithms on the Euclidean weight space. Empirical studies show that our algorithms can consistently achieve better performances over various experimental settings.

Related papers

Scaling of contraction costs for entanglement renormalization algorithms including tensor Trotterization and variational Monte Carlo [0.0]
We investigate whether tensor Trotterization and/or variational Monte Carlo sampling can lead to quantum-inspired classical MERA algorithms. Algorithmic phase diagrams indicate the best MERA method depending on the scaling of the energy accuracy and the optimal number of Trotter steps with the bond dimension.
arXiv Detail & Related papers (2024-07-30T17:54:15Z)
Convergence of Implicit Gradient Descent for Training Two-Layer Physics-Informed Neural Networks [3.680127959836384]
implicit gradient descent (IGD) outperforms the common gradient descent (GD) in handling certain multi-scale problems. We show that IGD converges a globally optimal solution at a linear convergence rate.
arXiv Detail & Related papers (2024-07-03T06:10:41Z)
Concurrent Training and Layer Pruning of Deep Neural Networks [0.0]
We propose an algorithm capable of identifying and eliminating irrelevant layers of a neural network during the early stages of training. We employ a structure using residual connections around nonlinear network sections that allow the flow of information through the network once a nonlinear section is pruned.
arXiv Detail & Related papers (2024-06-06T23:19:57Z)
Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training. Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z)
Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z)
Equivariant Deep Weight Space Alignment [54.65847470115314]
We propose a novel framework aimed at learning to solve the weight alignment problem. We first prove that weight alignment adheres to two fundamental symmetries and then, propose a deep architecture that respects these symmetries.
arXiv Detail & Related papers (2023-10-20T10:12:06Z)
On the Effect of Initialization: The Scaling Path of 2-Layer Neural Networks [21.69222364939501]
In supervised learning, the regularization path is sometimes used as a convenient theoretical proxy for the optimization path of gradient descent from zero. We show that the path interpolates continuously between the so-called kernel and rich regimes.
arXiv Detail & Related papers (2023-03-31T05:32:11Z)
Implicit Stochastic Gradient Descent for Training Physics-informed Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems. PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features. In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z)
AskewSGD : An Annealed interval-constrained Optimisation method to train Quantized Neural Networks [12.229154524476405]
We develop a new algorithm, Annealed Skewed SGD - AskewSGD - for training deep neural networks (DNNs) with quantized weights. Unlike algorithms with active sets and feasible directions, AskewSGD avoids projections or optimization under the entire feasible set. Experimental results show that the AskewSGD algorithm performs better than or on par with state of the art methods in classical benchmarks.
arXiv Detail & Related papers (2022-11-07T18:13:44Z)
Joint inference and input optimization in equilibrium networks [68.63726855991052]
deep equilibrium model is a class of models that foregoes traditional network depth and instead computes the output of a network by finding the fixed point of a single nonlinear layer. We show that there is a natural synergy between these two settings. We demonstrate this strategy on various tasks such as training generative models while optimizing over latent codes, training models for inverse problems like denoising and inpainting, adversarial training and gradient based meta-learning.
arXiv Detail & Related papers (2021-11-25T19:59:33Z)
Improving the Backpropagation Algorithm with Consequentialism Weight Updates over Mini-Batches [0.40611352512781856]
We show that it is possible to consider a multi-layer neural network as a stack of adaptive filters. We introduce a better algorithm by predicting then emending the adverse consequences of the actions that take place in BP even before they happen. Our experiments show the usefulness of our algorithm in the training of deep neural networks.
arXiv Detail & Related papers (2020-03-11T08:45:36Z)
Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks. In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems. Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.