Related papers: Hierarchical Weight Averaging for Deep Neural Networks

Hierarchical Weight Averaging for Deep Neural Networks

URL: http://arxiv.org/abs/2304.11519v1
Date: Sun, 23 Apr 2023 02:58:03 GMT
Title: Hierarchical Weight Averaging for Deep Neural Networks
Authors: Xiaozhe Gu, Zixun Zhang, Yuncheng Jiang, Tao Luo, Ruimao Zhang, Shuguang Cui, Zhen Li
Abstract summary: gradient descent (SGD)-like algorithms are successful in training deep neural networks (DNNs) Weight averaging (WA) which averages the weights of multiple models has recently received much attention in the literature. In this work, we firstly attempt to incorporate online and offline WA into a general training framework termed Hierarchical Weight Averaging (HWA)
Score: 39.45493779043969
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the simplicity, stochastic gradient descent (SGD)-like algorithms are successful in training deep neural networks (DNNs). Among various attempts to improve SGD, weight averaging (WA), which averages the weights of multiple models, has recently received much attention in the literature. Broadly, WA falls into two categories: 1) online WA, which averages the weights of multiple models trained in parallel, is designed for reducing the gradient communication overhead of parallel mini-batch SGD, and 2) offline WA, which averages the weights of one model at different checkpoints, is typically used to improve the generalization ability of DNNs. Though online and offline WA are similar in form, they are seldom associated with each other. Besides, these methods typically perform either offline parameter averaging or online parameter averaging, but not both. In this work, we firstly attempt to incorporate online and offline WA into a general training framework termed Hierarchical Weight Averaging (HWA). By leveraging both the online and offline averaging manners, HWA is able to achieve both faster convergence speed and superior generalization performance without any fancy learning rate adjustment. Besides, we also analyze the issues faced by existing WA methods, and how our HWA address them, empirically. Finally, extensive experiments verify that HWA outperforms the state-of-the-art methods significantly.

Related papers

SeWA: Selective Weight Average via Probabilistic Masking [51.015724517293236]
We show that only a few points are needed to achieve better and faster convergence. We transform the discrete selection problem into a continuous subset optimization framework. We derive the SeWA's stability bounds, which are sharper than that under both convex image checkpoints.
arXiv Detail & Related papers (2025-02-14T12:35:21Z)
MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
Large gradient algorithms like Adam, Adam, and their variants have been central to the development of this type of training. We propose a framework that reconciles preconditioned gradient optimization methods with variance reduction via a scaled momentum technique.
arXiv Detail & Related papers (2024-11-15T18:57:39Z)
Lookaround Optimizer: $k$ steps around, 1 step average [36.207388029666625]
Weight Average (WA) is an active research topic due to its simplicity in ensembling deep networks and the effectiveness in promoting generalization. Existing weight average approaches, however, are often carried out along only one training trajectory in a post-hoc manner. We propose Lookaround, a straightforward yet effective SGD-based generalization leading to flatter minima with better generalization.
arXiv Detail & Related papers (2023-06-13T10:55:20Z)
Trainable Weight Averaging: Accelerating Training and Improving Generalization [35.17919937007783]
We introduce Trainable Weight Averaging (TWA), a novel optimization method that operates within a reduced subspace spanned by candidate weights. TWA offers greater flexibility and can be applied to different training scenarios. For large-scale applications, we develop a distributed training framework that combines parallel computation with low-bit compression.
arXiv Detail & Related papers (2022-05-26T01:54:48Z)
Diverse Weight Averaging for Out-of-Distribution Generalization [100.22155775568761]
We propose Diverse Weight Averaging (DiWA) to average weights obtained from several independent training runs rather than from a single run. DiWA consistently improves the state of the art on the competitive DomainBed benchmark without inference overhead.
arXiv Detail & Related papers (2022-05-19T17:44:22Z)
Stochastic Weight Averaging Revisited [5.68481425260348]
We show that SWA's performance is highly dependent on to what extent the SGD process that runs before SWA converges. We show that following an SGD process with insufficient convergence, running SWA more times leads to continual incremental benefits in terms of generalization.
arXiv Detail & Related papers (2022-01-03T08:29:01Z)
Dynamic Slimmable Network [105.74546828182834]
We develop a dynamic network slimming regime named Dynamic Slimmable Network (DS-Net) Our DS-Net is empowered with the ability of dynamic inference by the proposed double-headed dynamic gate. It consistently outperforms its static counterparts as well as state-of-the-art static and dynamic model compression methods.
arXiv Detail & Related papers (2021-03-24T15:25:20Z)
MetaDistiller: Network Self-Boosting via Meta-Learned Top-Down Distillation [153.56211546576978]
In this work, we propose that better soft targets with higher compatibil-ity can be generated by using a label generator. We can employ the meta-learning technique to optimize this label generator. The experiments are conducted on two standard classificationbenchmarks, namely CIFAR-100 and ILSVRC2012.
arXiv Detail & Related papers (2020-08-27T13:04:27Z)
AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning. It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights. In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z)
Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging [34.55741812648229]
We present WAGMA-SGD, a wait-avoiding subgroup that reduces global communication via weight exchange. We train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale. Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput.
arXiv Detail & Related papers (2020-04-30T22:11:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.