Lookaround Optimizer: $k$ steps around, 1 step average
- URL: http://arxiv.org/abs/2306.07684v3
- Date: Thu, 2 Nov 2023 15:24:29 GMT
- Title: Lookaround Optimizer: $k$ steps around, 1 step average
- Authors: Jiangtao Zhang, Shunyu Liu, Jie Song, Tongtian Zhu, Zhengqi Xu, Mingli
Song
- Abstract summary: Weight Average (WA) is an active research topic due to its simplicity in ensembling deep networks and the effectiveness in promoting generalization.
Existing weight average approaches, however, are often carried out along only one training trajectory in a post-hoc manner.
We propose Lookaround, a straightforward yet effective SGD-based generalization leading to flatter minima with better generalization.
- Score: 36.207388029666625
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Weight Average (WA) is an active research topic due to its simplicity in
ensembling deep networks and the effectiveness in promoting generalization.
Existing weight average approaches, however, are often carried out along only
one training trajectory in a post-hoc manner (i.e., the weights are averaged
after the entire training process is finished), which significantly degrades
the diversity between networks and thus impairs the effectiveness. In this
paper, inspired by weight average, we propose Lookaround, a straightforward yet
effective SGD-based optimizer leading to flatter minima with better
generalization. Specifically, Lookaround iterates two steps during the whole
training period: the around step and the average step. In each iteration, 1)
the around step starts from a common point and trains multiple networks
simultaneously, each on transformed data by a different data augmentation, and
2) the average step averages these trained networks to get the averaged
network, which serves as the starting point for the next iteration. The around
step improves the functionality diversity while the average step guarantees the
weight locality of these networks during the whole training, which is essential
for WA to work. We theoretically explain the superiority of Lookaround by
convergence analysis, and make extensive experiments to evaluate Lookaround on
popular benchmarks including CIFAR and ImageNet with both CNNs and ViTs,
demonstrating clear superiority over state-of-the-arts. Our code is available
at https://github.com/Ardcy/Lookaround.
Related papers
- Efficient Stagewise Pretraining via Progressive Subnetworks [53.00045381931778]
The prevailing view suggests that stagewise dropping strategies, such as layer dropping, are ineffective when compared to stacking-based approaches.
This paper challenges this notion by demonstrating that, with proper design, dropping strategies can be competitive, if not better, than stacking methods.
We propose an instantiation of this framework - Random Part Training (RAPTR) - that selects and trains only a random subnetwork at each step, progressively increasing the size in stages.
arXiv Detail & Related papers (2024-02-08T18:49:09Z) - Hierarchical Weight Averaging for Deep Neural Networks [39.45493779043969]
gradient descent (SGD)-like algorithms are successful in training deep neural networks (DNNs)
Weight averaging (WA) which averages the weights of multiple models has recently received much attention in the literature.
In this work, we firstly attempt to incorporate online and offline WA into a general training framework termed Hierarchical Weight Averaging (HWA)
arXiv Detail & Related papers (2023-04-23T02:58:03Z) - PA&DA: Jointly Sampling PAth and DAta for Consistent NAS [8.737995937682271]
One-shot NAS methods train a supernet and then inherit the pre-trained weights to evaluate sub-models.
Large gradient variance occurs during supernet training, which degrades the supernet ranking consistency.
We propose to explicitly minimize the gradient variance of the supernet training by jointly optimizing the sampling distributions of PAth and DAta.
arXiv Detail & Related papers (2023-02-28T17:14:24Z) - Co-training $2^L$ Submodels for Visual Recognition [67.02999567435626]
Submodel co-training is a regularization method related to co-training, self-distillation and depth.
We show that submodel co-training is effective to train backbones for recognition tasks such as image classification and semantic segmentation.
arXiv Detail & Related papers (2022-12-09T14:38:09Z) - Learning to Weight Samples for Dynamic Early-exiting Networks [35.03752825893429]
Early exiting is an effective paradigm for improving the inference efficiency of deep networks.
Our work proposes to adopt a weight prediction network to weight the loss of different training samples at each exit.
We show that the proposed weighting mechanism consistently improves the trade-off between classification accuracy and inference efficiency.
arXiv Detail & Related papers (2022-09-17T10:46:32Z) - Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples.
We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment.
We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z) - Cream of the Crop: Distilling Prioritized Paths For One-Shot Neural
Architecture Search [60.965024145243596]
One-shot weight sharing methods have recently drawn great attention in neural architecture search due to high efficiency and competitive performance.
To alleviate this problem, we present a simple yet effective architecture distillation method.
We introduce the concept of prioritized path, which refers to the architecture candidates exhibiting superior performance during training.
Since the prioritized paths are changed on the fly depending on their performance and complexity, the final obtained paths are the cream of the crop.
arXiv Detail & Related papers (2020-10-29T17:55:05Z) - Training Sparse Neural Networks using Compressed Sensing [13.84396596420605]
We develop and test a novel method based on compressed sensing which combines the pruning and training into a single step.
Specifically, we utilize an adaptively weighted $ell1$ penalty on the weights during training, which we combine with a generalization of the regularized dual averaging (RDA) algorithm in order to train sparse neural networks.
arXiv Detail & Related papers (2020-08-21T19:35:54Z) - AdamP: Slowing Down the Slowdown for Momentum Optimizers on
Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning.
It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights.
In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.