FixNorm: Dissecting Weight Decay for Training Deep Neural Networks
- URL: http://arxiv.org/abs/2103.15345v1
- Date: Mon, 29 Mar 2021 05:41:56 GMT
- Title: FixNorm: Dissecting Weight Decay for Training Deep Neural Networks
- Authors: Yucong Zhou, Yunxiao Sun, Zhao Zhong
- Abstract summary: We propose a new training method called FixNorm, which discards weight decay and directly controls the two mechanisms.
On ImageNet classification task, training EfficientNet-B0 with FixNorm achieves 77.7%, which outperforms the original baseline by a clear margin.
- Score: 7.820667552233989
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weight decay is a widely used technique for training Deep Neural
Networks(DNN). It greatly affects generalization performance but the underlying
mechanisms are not fully understood. Recent works show that for layers followed
by normalizations, weight decay mainly affects the effective learning rate.
However, despite normalizations have been extensively adopted in modern DNNs,
layers such as the final fully-connected layer do not satisfy this
precondition. For these layers, the effects of weight decay are still unclear.
In this paper, we comprehensively investigate the mechanisms of weight decay
and find that except for influencing effective learning rate, weight decay has
another distinct mechanism that is equally important: affecting generalization
performance by controlling cross-boundary risk. These two mechanisms together
give a more comprehensive explanation for the effects of weight decay. Based on
this discovery, we propose a new training method called FixNorm, which discards
weight decay and directly controls the two mechanisms. We also propose a simple
yet effective method to tune hyperparameters of FixNorm, which can find
near-optimal solutions in a few trials. On ImageNet classification task,
training EfficientNet-B0 with FixNorm achieves 77.7%, which outperforms the
original baseline by a clear margin. Surprisingly, when scaling MobileNetV2 to
the same FLOPS and applying the same tricks with EfficientNet-B0, training with
FixNorm achieves 77.4%, which is only 0.3% lower. A series of SOTA results show
the importance of well-tuned training procedures, and further verify the
effectiveness of our approach. We set up more well-tuned baselines using
FixNorm, to facilitate fair comparisons in the community.
Related papers
- Achieving Constraints in Neural Networks: A Stochastic Augmented
Lagrangian Approach [49.1574468325115]
Regularizing Deep Neural Networks (DNNs) is essential for improving generalizability and preventing overfitting.
We propose a novel approach to DNN regularization by framing the training process as a constrained optimization problem.
We employ the Augmented Lagrangian (SAL) method to achieve a more flexible and efficient regularization mechanism.
arXiv Detail & Related papers (2023-10-25T13:55:35Z) - FedNAR: Federated Optimization with Normalized Annealing Regularization [54.42032094044368]
We explore the choices of weight decay and identify that weight decay value appreciably influences the convergence of existing FL algorithms.
We develop Federated optimization with Normalized Annealing Regularization (FedNAR), a plug-in that can be seamlessly integrated into any existing FL algorithms.
arXiv Detail & Related papers (2023-10-04T21:11:40Z) - PathProx: A Proximal Gradient Algorithm for Weight Decay Regularized
Deep Neural Networks [25.114642281756495]
Weight decay is one of the most widely used forms of regularization in deep learning.
This paper argues that gradient descent may be an inefficient algorithm for this objective.
For neural networks with ReLU activations, solutions to the weight decay objective are equivalent to those of a different objective.
arXiv Detail & Related papers (2022-10-06T17:22:40Z) - BiTAT: Neural Network Binarization with Task-dependent Aggregated
Transformation [116.26521375592759]
Quantization aims to transform high-precision weights and activations of a given neural network into low-precision weights/activations for reduced memory usage and computation.
Extreme quantization (1-bit weight/1-bit activations) of compactly-designed backbone architectures results in severe performance degeneration.
This paper proposes a novel Quantization-Aware Training (QAT) method that can effectively alleviate performance degeneration.
arXiv Detail & Related papers (2022-07-04T13:25:49Z) - SGD and Weight Decay Secretly Minimize the Rank of Your Neural Network [8.79431718760617]
Training with mini-batch SGD and weight decay induces a bias toward rank minimization in weight matrices.
We show that this bias becomes more pronounced with smaller batch sizes, higher learning rates, or stronger weight decay.
We empirically explore the connection between this bias and generalization, finding that it has a marginal effect on the test performance.
arXiv Detail & Related papers (2022-06-12T17:06:35Z) - Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models.
Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.
Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z) - The Implicit Biases of Stochastic Gradient Descent on Deep Neural
Networks with Batch Normalization [44.30960913470372]
Deep neural networks with batch normalization (BN-DNNs) are invariant to weight rescaling due to their normalization operations.
We investigate the implicit biases of gradient descent (SGD) on BN-DNNs to provide a theoretical explanation for the efficacy of weight decay.
arXiv Detail & Related papers (2021-02-06T03:40:20Z) - MimicNorm: Weight Mean and Last BN Layer Mimic the Dynamic of Batch
Normalization [60.36100335878855]
We propose a novel normalization method, named MimicNorm, to improve the convergence and efficiency in network training.
We leverage the neural kernel (NTK) theory to prove that our weight mean operation whitens activations and transits network into the chaotic regime like BN layer.
MimicNorm achieves similar accuracy for various network structures, including ResNets and lightweight networks like ShuffleNet, with a reduction of about 20% memory consumption.
arXiv Detail & Related papers (2020-10-19T07:42:41Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.