Towards Accelerating Training of Batch Normalization: A Manifold
Perspective
- URL: http://arxiv.org/abs/2101.02916v1
- Date: Fri, 8 Jan 2021 08:53:07 GMT
- Title: Towards Accelerating Training of Batch Normalization: A Manifold
Perspective
- Authors: Mingyang Yi, Qi Meng, Wei Chen, Zhi-Ming Ma
- Abstract summary: Batch normalization (BN) has become a crucial component across diverse deep neural networks.
We propose a quotient manifold emphPSI manifold, in which all the equivalent weights of the network with BN are regarded as the same one element.
We show that our algorithms can consistently achieve better performances over various experimental settings.
- Score: 19.55158964644964
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Batch normalization (BN) has become a crucial component across diverse deep
neural networks. The network with BN is invariant to positively linear
re-scaling of weights, which makes there exist infinite functionally equivalent
networks with various scales of weights. However, optimizing these equivalent
networks with the first-order method such as stochastic gradient descent will
converge to different local optima owing to different gradients across
training. To alleviate this, we propose a quotient manifold \emph{PSI
manifold}, in which all the equivalent weights of the network with BN are
regarded as the same one element. Then, gradient descent and stochastic
gradient descent on the PSI manifold are also constructed. The two algorithms
guarantee that every group of equivalent weights (caused by positively
re-scaling) converge to the equivalent optima. Besides that, we give the
convergence rate of the proposed algorithms on PSI manifold and justify that
they accelerate training compared with the algorithms on the Euclidean weight
space. Empirical studies show that our algorithms can consistently achieve
better performances over various experimental settings.
Related papers
- Convergence of Implicit Gradient Descent for Training Two-Layer Physics-Informed Neural Networks [3.680127959836384]
implicit gradient descent (IGD) outperforms the common gradient descent (GD) in handling certain multi-scale problems.
We show that IGD converges a globally optimal solution at a linear convergence rate.
arXiv Detail & Related papers (2024-07-03T06:10:41Z) - Concurrent Training and Layer Pruning of Deep Neural Networks [0.0]
We propose an algorithm capable of identifying and eliminating irrelevant layers of a neural network during the early stages of training.
We employ a structure using residual connections around nonlinear network sections that allow the flow of information through the network once a nonlinear section is pruned.
arXiv Detail & Related papers (2024-06-06T23:19:57Z) - Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Equivariant Deep Weight Space Alignment [54.65847470115314]
We propose a novel framework aimed at learning to solve the weight alignment problem.
We first prove that weight alignment adheres to two fundamental symmetries and then, propose a deep architecture that respects these symmetries.
arXiv Detail & Related papers (2023-10-20T10:12:06Z) - On the Effect of Initialization: The Scaling Path of 2-Layer Neural
Networks [21.69222364939501]
In supervised learning, the regularization path is sometimes used as a convenient theoretical proxy for the optimization path of gradient descent from zero.
We show that the path interpolates continuously between the so-called kernel and rich regimes.
arXiv Detail & Related papers (2023-03-31T05:32:11Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - AskewSGD : An Annealed interval-constrained Optimisation method to train
Quantized Neural Networks [12.229154524476405]
We develop a new algorithm, Annealed Skewed SGD - AskewSGD - for training deep neural networks (DNNs) with quantized weights.
Unlike algorithms with active sets and feasible directions, AskewSGD avoids projections or optimization under the entire feasible set.
Experimental results show that the AskewSGD algorithm performs better than or on par with state of the art methods in classical benchmarks.
arXiv Detail & Related papers (2022-11-07T18:13:44Z) - Joint inference and input optimization in equilibrium networks [68.63726855991052]
deep equilibrium model is a class of models that foregoes traditional network depth and instead computes the output of a network by finding the fixed point of a single nonlinear layer.
We show that there is a natural synergy between these two settings.
We demonstrate this strategy on various tasks such as training generative models while optimizing over latent codes, training models for inverse problems like denoising and inpainting, adversarial training and gradient based meta-learning.
arXiv Detail & Related papers (2021-11-25T19:59:33Z) - Improving the Backpropagation Algorithm with Consequentialism Weight
Updates over Mini-Batches [0.40611352512781856]
We show that it is possible to consider a multi-layer neural network as a stack of adaptive filters.
We introduce a better algorithm by predicting then emending the adverse consequences of the actions that take place in BP even before they happen.
Our experiments show the usefulness of our algorithm in the training of deep neural networks.
arXiv Detail & Related papers (2020-03-11T08:45:36Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.