Stochastic Weight Averaging Revisited
- URL: http://arxiv.org/abs/2201.00519v1
- Date: Mon, 3 Jan 2022 08:29:01 GMT
- Title: Stochastic Weight Averaging Revisited
- Authors: Hao Guo, Jiyong Jin, Bin Liu
- Abstract summary: We show that SWA's performance is highly dependent on to what extent the SGD process that runs before SWA converges.
We show that following an SGD process with insufficient convergence, running SWA more times leads to continual incremental benefits in terms of generalization.
- Score: 5.68481425260348
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Stochastic weight averaging (SWA) is recognized as a simple while one
effective approach to improve the generalization of stochastic gradient descent
(SGD) for training deep neural networks (DNNs). A common insight to explain its
success is that averaging weights following an SGD process equipped with
cyclical or high constant learning rates can discover wider optima, which then
lead to better generalization. We give a new insight that does not concur with
the above one. We characterize that SWA's performance is highly dependent on to
what extent the SGD process that runs before SWA converges, and the operation
of weight averaging only contributes to variance reduction. This new insight
suggests practical guides on better algorithm design. As an instantiation, we
show that following an SGD process with insufficient convergence, running SWA
more times leads to continual incremental benefits in terms of generalization.
Our findings are corroborated by extensive experiments across different network
architectures, including a baseline CNN, PreResNet-164, WideResNet-28-10,
VGG16, ResNet-50, ResNet-152, DenseNet-161, and different datasets including
CIFAR-{10,100}, and Imagenet.
Related papers
- Improving Generalization and Convergence by Enhancing Implicit Regularization [15.806923167905026]
Implicit Regularization Enhancement (IRE) framework to accelerate the discovery of flat solutions in deep learning.
IRE decouples the dynamics of flat and sharp directions, which boosts the sharpness reduction along flat directions.
We show that IRE can be practically incorporated with em generic bases without introducing significant computational overload.
arXiv Detail & Related papers (2024-05-31T12:32:34Z) - Hierarchical Weight Averaging for Deep Neural Networks [39.45493779043969]
gradient descent (SGD)-like algorithms are successful in training deep neural networks (DNNs)
Weight averaging (WA) which averages the weights of multiple models has recently received much attention in the literature.
In this work, we firstly attempt to incorporate online and offline WA into a general training framework termed Hierarchical Weight Averaging (HWA)
arXiv Detail & Related papers (2023-04-23T02:58:03Z) - Unifying Synergies between Self-supervised Learning and Dynamic
Computation [53.66628188936682]
We present a novel perspective on the interplay between SSL and DC paradigms.
We show that it is feasible to simultaneously learn a dense and gated sub-network from scratch in a SSL setting.
The co-evolution during pre-training of both dense and gated encoder offers a good accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-01-22T17:12:58Z) - Differentially private training of residual networks with scale
normalisation [64.60453677988517]
We investigate the optimal choice of replacement layer for Batch Normalisation (BN) in residual networks (ResNets)
We study the phenomenon of scale mixing in residual blocks, whereby the activations on the two branches are scaled differently.
arXiv Detail & Related papers (2022-03-01T09:56:55Z) - Multiplicative Reweighting for Robust Neural Network Optimization [51.67267839555836]
Multiplicative weight (MW) updates are robust to moderate data corruptions in expert advice.
We show that MW improves the accuracy of neural networks in the presence of label noise.
arXiv Detail & Related papers (2021-02-24T10:40:25Z) - The Implicit Biases of Stochastic Gradient Descent on Deep Neural
Networks with Batch Normalization [44.30960913470372]
Deep neural networks with batch normalization (BN-DNNs) are invariant to weight rescaling due to their normalization operations.
We investigate the implicit biases of gradient descent (SGD) on BN-DNNs to provide a theoretical explanation for the efficacy of weight decay.
arXiv Detail & Related papers (2021-02-06T03:40:20Z) - Neural networks with late-phase weights [66.72777753269658]
We show that the solutions found by SGD can be further improved by ensembling a subset of the weights in late stages of learning.
At the end of learning, we obtain back a single model by taking a spatial average in weight space.
arXiv Detail & Related papers (2020-07-25T13:23:37Z) - AdamP: Slowing Down the Slowdown for Momentum Optimizers on
Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning.
It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights.
In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Breaking (Global) Barriers in Parallel Stochastic Optimization with
Wait-Avoiding Group Averaging [34.55741812648229]
We present WAGMA-SGD, a wait-avoiding subgroup that reduces global communication via weight exchange.
We train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale.
Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput.
arXiv Detail & Related papers (2020-04-30T22:11:53Z) - Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent [32.40217829362088]
We propose a new NAG-style scheme for training deep neural networks (DNNs)
SRSGD replaces the constant momentum in SGD by the increasing momentum in NAG but stabilizes the iterations by resetting the momentum to zero according to a schedule.
On both CIFAR and ImageNet, SRSGD reaches similar or even better error rates with significantly fewer training epochs compared to the SGD baseline.
arXiv Detail & Related papers (2020-02-24T23:16:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.