Hierarchical Weight Averaging for Deep Neural Networks
- URL: http://arxiv.org/abs/2304.11519v1
- Date: Sun, 23 Apr 2023 02:58:03 GMT
- Title: Hierarchical Weight Averaging for Deep Neural Networks
- Authors: Xiaozhe Gu, Zixun Zhang, Yuncheng Jiang, Tao Luo, Ruimao Zhang,
Shuguang Cui, Zhen Li
- Abstract summary: gradient descent (SGD)-like algorithms are successful in training deep neural networks (DNNs)
Weight averaging (WA) which averages the weights of multiple models has recently received much attention in the literature.
In this work, we firstly attempt to incorporate online and offline WA into a general training framework termed Hierarchical Weight Averaging (HWA)
- Score: 39.45493779043969
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the simplicity, stochastic gradient descent (SGD)-like algorithms are
successful in training deep neural networks (DNNs). Among various attempts to
improve SGD, weight averaging (WA), which averages the weights of multiple
models, has recently received much attention in the literature. Broadly, WA
falls into two categories: 1) online WA, which averages the weights of multiple
models trained in parallel, is designed for reducing the gradient communication
overhead of parallel mini-batch SGD, and 2) offline WA, which averages the
weights of one model at different checkpoints, is typically used to improve the
generalization ability of DNNs. Though online and offline WA are similar in
form, they are seldom associated with each other. Besides, these methods
typically perform either offline parameter averaging or online parameter
averaging, but not both. In this work, we firstly attempt to incorporate online
and offline WA into a general training framework termed Hierarchical Weight
Averaging (HWA). By leveraging both the online and offline averaging manners,
HWA is able to achieve both faster convergence speed and superior
generalization performance without any fancy learning rate adjustment. Besides,
we also analyze the issues faced by existing WA methods, and how our HWA
address them, empirically. Finally, extensive experiments verify that HWA
outperforms the state-of-the-art methods significantly.
Related papers
- MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
Large gradient algorithms like Adam, Adam, and their variants have been central to the development of this type of training.
We propose a framework that reconciles preconditioned gradient optimization methods with variance reduction via a scaled momentum technique.
arXiv Detail & Related papers (2024-11-15T18:57:39Z) - Lookaround Optimizer: $k$ steps around, 1 step average [36.207388029666625]
Weight Average (WA) is an active research topic due to its simplicity in ensembling deep networks and the effectiveness in promoting generalization.
Existing weight average approaches, however, are often carried out along only one training trajectory in a post-hoc manner.
We propose Lookaround, a straightforward yet effective SGD-based generalization leading to flatter minima with better generalization.
arXiv Detail & Related papers (2023-06-13T10:55:20Z) - Diverse Weight Averaging for Out-of-Distribution Generalization [100.22155775568761]
We propose Diverse Weight Averaging (DiWA) to average weights obtained from several independent training runs rather than from a single run.
DiWA consistently improves the state of the art on the competitive DomainBed benchmark without inference overhead.
arXiv Detail & Related papers (2022-05-19T17:44:22Z) - Stochastic Weight Averaging Revisited [5.68481425260348]
We show that SWA's performance is highly dependent on to what extent the SGD process that runs before SWA converges.
We show that following an SGD process with insufficient convergence, running SWA more times leads to continual incremental benefits in terms of generalization.
arXiv Detail & Related papers (2022-01-03T08:29:01Z) - Dynamic Slimmable Network [105.74546828182834]
We develop a dynamic network slimming regime named Dynamic Slimmable Network (DS-Net)
Our DS-Net is empowered with the ability of dynamic inference by the proposed double-headed dynamic gate.
It consistently outperforms its static counterparts as well as state-of-the-art static and dynamic model compression methods.
arXiv Detail & Related papers (2021-03-24T15:25:20Z) - MetaDistiller: Network Self-Boosting via Meta-Learned Top-Down
Distillation [153.56211546576978]
In this work, we propose that better soft targets with higher compatibil-ity can be generated by using a label generator.
We can employ the meta-learning technique to optimize this label generator.
The experiments are conducted on two standard classificationbenchmarks, namely CIFAR-100 and ILSVRC2012.
arXiv Detail & Related papers (2020-08-27T13:04:27Z) - AdamP: Slowing Down the Slowdown for Momentum Optimizers on
Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning.
It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights.
In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z) - Breaking (Global) Barriers in Parallel Stochastic Optimization with
Wait-Avoiding Group Averaging [34.55741812648229]
We present WAGMA-SGD, a wait-avoiding subgroup that reduces global communication via weight exchange.
We train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale.
Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput.
arXiv Detail & Related papers (2020-04-30T22:11:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.