Related papers: Lookaround Optimizer: $k$ steps around, 1 step average

Lookaround Optimizer: $k$ steps around, 1 step average

URL: http://arxiv.org/abs/2306.07684v3
Date: Thu, 2 Nov 2023 15:24:29 GMT
Title: Lookaround Optimizer: $k$ steps around, 1 step average
Authors: Jiangtao Zhang, Shunyu Liu, Jie Song, Tongtian Zhu, Zhengqi Xu, Mingli Song
Abstract summary: Weight Average (WA) is an active research topic due to its simplicity in ensembling deep networks and the effectiveness in promoting generalization. Existing weight average approaches, however, are often carried out along only one training trajectory in a post-hoc manner. We propose Lookaround, a straightforward yet effective SGD-based generalization leading to flatter minima with better generalization.
Score: 36.207388029666625
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Weight Average (WA) is an active research topic due to its simplicity in ensembling deep networks and the effectiveness in promoting generalization. Existing weight average approaches, however, are often carried out along only one training trajectory in a post-hoc manner (i.e., the weights are averaged after the entire training process is finished), which significantly degrades the diversity between networks and thus impairs the effectiveness. In this paper, inspired by weight average, we propose Lookaround, a straightforward yet effective SGD-based optimizer leading to flatter minima with better generalization. Specifically, Lookaround iterates two steps during the whole training period: the around step and the average step. In each iteration, 1) the around step starts from a common point and trains multiple networks simultaneously, each on transformed data by a different data augmentation, and 2) the average step averages these trained networks to get the averaged network, which serves as the starting point for the next iteration. The around step improves the functionality diversity while the average step guarantees the weight locality of these networks during the whole training, which is essential for WA to work. We theoretically explain the superiority of Lookaround by convergence analysis, and make extensive experiments to evaluate Lookaround on popular benchmarks including CIFAR and ImageNet with both CNNs and ViTs, demonstrating clear superiority over state-of-the-arts. Our code is available at https://github.com/Ardcy/Lookaround.

Related papers

When, Where and Why to Average Weights? [36.106114687828395]
Averaging checkpoints along the training trajectory is a powerful approach to improve the generalization performance of Machine Learning models. We show that averaging significantly accelerates training and yields considerable efficiency gains, at the price of a minimal implementation and memory cost.
arXiv Detail & Related papers (2025-02-10T18:40:48Z)
Efficient Stagewise Pretraining via Progressive Subnetworks [53.00045381931778]
The prevailing view suggests that stagewise dropping strategies, such as layer dropping, are ineffective when compared to stacking-based approaches. This paper challenges this notion by demonstrating that, with proper design, dropping strategies can be competitive, if not better, than stacking methods. We propose an instantiation of this framework - Random Part Training (RAPTR) - that selects and trains only a random subnetwork at each step, progressively increasing the size in stages.
arXiv Detail & Related papers (2024-02-08T18:49:09Z)
Hierarchical Weight Averaging for Deep Neural Networks [39.45493779043969]
gradient descent (SGD)-like algorithms are successful in training deep neural networks (DNNs) Weight averaging (WA) which averages the weights of multiple models has recently received much attention in the literature. In this work, we firstly attempt to incorporate online and offline WA into a general training framework termed Hierarchical Weight Averaging (HWA)
arXiv Detail & Related papers (2023-04-23T02:58:03Z)
PA&DA: Jointly Sampling PAth and DAta for Consistent NAS [8.737995937682271]
One-shot NAS methods train a supernet and then inherit the pre-trained weights to evaluate sub-models. Large gradient variance occurs during supernet training, which degrades the supernet ranking consistency. We propose to explicitly minimize the gradient variance of the supernet training by jointly optimizing the sampling distributions of PAth and DAta.
arXiv Detail & Related papers (2023-02-28T17:14:24Z)
Co-training $2^L$ Submodels for Visual Recognition [67.02999567435626]
Submodel co-training is a regularization method related to co-training, self-distillation and depth. We show that submodel co-training is effective to train backbones for recognition tasks such as image classification and semantic segmentation.
arXiv Detail & Related papers (2022-12-09T14:38:09Z)
Learning to Weight Samples for Dynamic Early-exiting Networks [35.03752825893429]
Early exiting is an effective paradigm for improving the inference efficiency of deep networks. Our work proposes to adopt a weight prediction network to weight the loss of different training samples at each exit. We show that the proposed weighting mechanism consistently improves the trade-off between classification accuracy and inference efficiency.
arXiv Detail & Related papers (2022-09-17T10:46:32Z)
Trainable Weight Averaging: Accelerating Training and Improving Generalization [35.17919937007783]
We introduce Trainable Weight Averaging (TWA), a novel optimization method that operates within a reduced subspace spanned by candidate weights. TWA offers greater flexibility and can be applied to different training scenarios. For large-scale applications, we develop a distributed training framework that combines parallel computation with low-bit compression.
arXiv Detail & Related papers (2022-05-26T01:54:48Z)
Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples. We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment. We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z)
Cream of the Crop: Distilling Prioritized Paths For One-Shot Neural Architecture Search [60.965024145243596]
One-shot weight sharing methods have recently drawn great attention in neural architecture search due to high efficiency and competitive performance. To alleviate this problem, we present a simple yet effective architecture distillation method. We introduce the concept of prioritized path, which refers to the architecture candidates exhibiting superior performance during training. Since the prioritized paths are changed on the fly depending on their performance and complexity, the final obtained paths are the cream of the crop.
arXiv Detail & Related papers (2020-10-29T17:55:05Z)
Training Sparse Neural Networks using Compressed Sensing [13.84396596420605]
We develop and test a novel method based on compressed sensing which combines the pruning and training into a single step. Specifically, we utilize an adaptively weighted $ell1$ penalty on the weights during training, which we combine with a generalization of the regularized dual averaging (RDA) algorithm in order to train sparse neural networks.
arXiv Detail & Related papers (2020-08-21T19:35:54Z)
AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning. It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights. In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.