The Implicit Biases of Stochastic Gradient Descent on Deep Neural
Networks with Batch Normalization
- URL: http://arxiv.org/abs/2102.03497v1
- Date: Sat, 6 Feb 2021 03:40:20 GMT
- Title: The Implicit Biases of Stochastic Gradient Descent on Deep Neural
Networks with Batch Normalization
- Authors: Ziquan Liu, Yufei Cui, Jia Wan, Yu Mao, Antoni B. Chan
- Abstract summary: Deep neural networks with batch normalization (BN-DNNs) are invariant to weight rescaling due to their normalization operations.
We investigate the implicit biases of gradient descent (SGD) on BN-DNNs to provide a theoretical explanation for the efficacy of weight decay.
- Score: 44.30960913470372
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep neural networks with batch normalization (BN-DNNs) are invariant to
weight rescaling due to their normalization operations. However, using weight
decay (WD) benefits these weight-scale-invariant networks, which is often
attributed to an increase of the effective learning rate when the weight norms
are decreased. In this paper, we demonstrate the insufficiency of the previous
explanation and investigate the implicit biases of stochastic gradient descent
(SGD) on BN-DNNs to provide a theoretical explanation for the efficacy of
weight decay. We identity two implicit biases of SGD on BN-DNNs: 1) the weight
norms in SGD training remain constant in the continuous-time domain and keep
increasing in the discrete-time domain; 2) SGD optimizes weight vectors in
fully-connected networks or convolution kernels in convolution neural networks
by updating components lying in the input feature span, while leaving those
components orthogonal to the input feature span unchanged. Thus, SGD without WD
accumulates weight noise orthogonal to the input feature span, and cannot
eliminate such noise. Our empirical studies corroborate the hypothesis that
weight decay suppresses weight noise that is left untouched by SGD.
Furthermore, we propose to use weight rescaling (WRS) instead of weight decay
to achieve the same regularization effect, while avoiding performance
degradation of WD on some momentum-based optimizers. Our empirical results on
image recognition show that regardless of optimization methods and network
architectures, training BN-DNNs using WRS achieves similar or better
performance compared with using WD. We also show that training with WRS
generalizes better compared to WD, on other computer vision tasks.
Related papers
- Towards Better Generalization: Weight Decay Induces Low-rank Bias for Neural Networks [9.948870430491738]
We study the implicit bias towards low-rank weight matrices when training neural networks with Weight Decay (WD)
Our work offers both theoretical and empirical insights into the strong generalization performance of SGD when combined with WD.
arXiv Detail & Related papers (2024-10-03T03:36:18Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - PathProx: A Proximal Gradient Algorithm for Weight Decay Regularized
Deep Neural Networks [25.114642281756495]
Weight decay is one of the most widely used forms of regularization in deep learning.
This paper argues that gradient descent may be an inefficient algorithm for this objective.
For neural networks with ReLU activations, solutions to the weight decay objective are equivalent to those of a different objective.
arXiv Detail & Related papers (2022-10-06T17:22:40Z) - BiTAT: Neural Network Binarization with Task-dependent Aggregated
Transformation [116.26521375592759]
Quantization aims to transform high-precision weights and activations of a given neural network into low-precision weights/activations for reduced memory usage and computation.
Extreme quantization (1-bit weight/1-bit activations) of compactly-designed backbone architectures results in severe performance degeneration.
This paper proposes a novel Quantization-Aware Training (QAT) method that can effectively alleviate performance degeneration.
arXiv Detail & Related papers (2022-07-04T13:25:49Z) - Self-Adaptive Physics-Informed Neural Networks using a Soft Attention Mechanism [1.6114012813668932]
Physics-Informed Neural Networks (PINNs) have emerged as a promising application of deep neural networks to the numerical solution of nonlinear partial differential equations (PDEs)
We propose a fundamentally new way to train PINNs adaptively, where the adaptation weights are fully trainable and applied to each training point individually.
In numerical experiments with several linear and nonlinear benchmark problems, the SA-PINN outperformed other state-of-the-art PINN algorithm in L2 error.
arXiv Detail & Related papers (2020-09-07T04:07:52Z) - Improve Generalization and Robustness of Neural Networks via Weight
Scale Shifting Invariant Regularizations [52.493315075385325]
We show that a family of regularizers, including weight decay, is ineffective at penalizing the intrinsic norms of weights for networks with homogeneous activation functions.
We propose an improved regularizer that is invariant to weight scale shifting and thus effectively constrains the intrinsic norm of a neural network.
arXiv Detail & Related papers (2020-08-07T02:55:28Z) - Neural networks with late-phase weights [66.72777753269658]
We show that the solutions found by SGD can be further improved by ensembling a subset of the weights in late stages of learning.
At the end of learning, we obtain back a single model by taking a spatial average in weight space.
arXiv Detail & Related papers (2020-07-25T13:23:37Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.