When Does Re-initialization Work?
- URL: http://arxiv.org/abs/2206.10011v2
- Date: Sun, 2 Apr 2023 22:19:08 GMT
- Title: When Does Re-initialization Work?
- Authors: Sheheryar Zaidi, Tudor Berariu, Hyunjik Kim, J\"org Bornschein,
Claudia Clopath, Yee Whye Teh, Razvan Pascanu
- Abstract summary: Re-initialization has been observed to improve generalization in recent works.
It is neither widely adopted in deep learning practice nor is it often used in state-of-the-art training protocols.
This raises the question of when re-initialization works, and whether it should be used together with regularization techniques.
- Score: 50.70297319284022
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Re-initializing a neural network during training has been observed to improve
generalization in recent works. Yet it is neither widely adopted in deep
learning practice nor is it often used in state-of-the-art training protocols.
This raises the question of when re-initialization works, and whether it should
be used together with regularization techniques such as data augmentation,
weight decay and learning rate schedules. In this work, we conduct an extensive
empirical comparison of standard training with a selection of re-initialization
methods to answer this question, training over 15,000 models on a variety of
image classification benchmarks. We first establish that such methods are
consistently beneficial for generalization in the absence of any other
regularization. However, when deployed alongside other carefully tuned
regularization techniques, re-initialization methods offer little to no added
benefit for generalization, although optimal generalization performance becomes
less sensitive to the choice of learning rate and weight decay hyperparameters.
To investigate the impact of re-initialization methods on noisy data, we also
consider learning under label noise. Surprisingly, in this case,
re-initialization significantly improves upon standard training, even in the
presence of other carefully tuned regularization techniques.
Related papers
- Exact, Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization [52.16435732772263]
Second-order optimization has been shown to accelerate the training of deep neural networks in many applications.
However, generalization properties of second-order methods are still being debated.
We show for the first time that exact Gauss-Newton (GN) updates take on a tractable form in a class of deep architectures.
arXiv Detail & Related papers (2024-11-12T17:58:40Z) - Rethinking Classifier Re-Training in Long-Tailed Recognition: A Simple
Logits Retargeting Approach [102.0769560460338]
We develop a simple logits approach (LORT) without the requirement of prior knowledge of the number of samples per class.
Our method achieves state-of-the-art performance on various imbalanced datasets, including CIFAR100-LT, ImageNet-LT, and iNaturalist 2018.
arXiv Detail & Related papers (2024-03-01T03:27:08Z) - Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in
Dense Encoders [63.28408887247742]
We study whether training procedures can be improved to yield better generalization capabilities in the resulting models.
We recommend a simple recipe for training dense encoders: Train on MSMARCO with parameter-efficient methods, such as LoRA, and opt for using in-batch negatives unless given well-constructed hard negatives.
arXiv Detail & Related papers (2023-11-16T10:42:58Z) - Understanding Overfitting in Adversarial Training via Kernel Regression [16.49123079820378]
Adversarial training and data augmentation with noise are widely adopted techniques to enhance the performance of neural networks.
This paper investigates adversarial training and data augmentation with noise in the context of regularized regression.
arXiv Detail & Related papers (2023-04-13T08:06:25Z) - Regularization-based Pruning of Irrelevant Weights in Deep Neural
Architectures [0.0]
We propose a method for learning sparse neural topologies via a regularization technique which identifies non relevant weights and selectively shrinks their norm.
We tested the proposed technique on different image classification and Natural language generation tasks, obtaining results on par or better then competitors in terms of sparsity and metrics.
arXiv Detail & Related papers (2022-04-11T09:44:16Z) - AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS)
Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z) - Continual Deep Learning by Functional Regularisation of Memorable Past [95.97578574330934]
Continually learning new skills is important for intelligent systems, yet standard deep learning methods suffer from catastrophic forgetting of the past.
We propose a new functional-regularisation approach that utilises a few memorable past examples crucial to avoid forgetting.
Our method achieves state-of-the-art performance on standard benchmarks and opens a new direction for life-long learning where regularisation and memory-based methods are naturally combined.
arXiv Detail & Related papers (2020-04-29T10:47:54Z) - AL2: Progressive Activation Loss for Learning General Representations in
Classification Neural Networks [12.14537824884951]
We propose a novel regularization method that progressively penalizes the magnitude of activations during training.
Our method's effect on generalization is analyzed with label randomization tests and cumulative ablations.
arXiv Detail & Related papers (2020-03-07T18:38:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.