Understanding the Generalization Benefits of Late Learning Rate Decay
- URL: http://arxiv.org/abs/2401.11600v1
- Date: Sun, 21 Jan 2024 21:11:09 GMT
- Title: Understanding the Generalization Benefits of Late Learning Rate Decay
- Authors: Yinuo Ren, Chao Ma, Lexing Ying
- Abstract summary: We show the relation between training and testing loss in neural networks.
We introduce a nonlinear model whose loss landscapes mirror those observed for real neural networks.
We demonstrate that an extended phase with a large learning rate steers our model towards the minimum norm solution of the training loss.
- Score: 14.471831651042367
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Why do neural networks trained with large learning rates for a longer time
often lead to better generalization? In this paper, we delve into this question
by examining the relation between training and testing loss in neural networks.
Through visualization of these losses, we note that the training trajectory
with a large learning rate navigates through the minima manifold of the
training loss, finally nearing the neighborhood of the testing loss minimum.
Motivated by these findings, we introduce a nonlinear model whose loss
landscapes mirror those observed for real neural networks. Upon investigating
the training process using SGD on our model, we demonstrate that an extended
phase with a large learning rate steers our model towards the minimum norm
solution of the training loss, which may achieve near-optimal generalization,
thereby affirming the empirically observed benefits of late learning rate
decay.
Related papers
- Simplicity bias and optimization threshold in two-layer ReLU networks [24.43739371803548]
We show that despite overparametrization, networks converge toward simpler solutions rather than interpolating the training data.
Our analysis relies on the so called early alignment phase, during which neurons align towards specific directions.
arXiv Detail & Related papers (2024-10-03T09:58:57Z) - Relearning Forgotten Knowledge: on Forgetting, Overfit and Training-Free
Ensembles of DNNs [9.010643838773477]
We introduce a novel score for quantifying overfit, which monitors the forgetting rate of deep models on validation data.
We show that overfit can occur with and without a decrease in validation accuracy, and may be more common than previously appreciated.
We use our observations to construct a new ensemble method, based solely on the training history of a single network, which provides significant improvement without any additional cost in training time.
arXiv Detail & Related papers (2023-10-17T09:22:22Z) - Learning Dynamics and Generalization in Reinforcement Learning [59.530058000689884]
We show theoretically that temporal difference learning encourages agents to fit non-smooth components of the value function early in training.
We show that neural networks trained using temporal difference algorithms on dense reward tasks exhibit weaker generalization between states than randomly networks and gradient networks trained with policy methods.
arXiv Detail & Related papers (2022-06-05T08:49:16Z) - Last Layer Re-Training is Sufficient for Robustness to Spurious
Correlations [51.552870594221865]
We show that last layer retraining can match or outperform state-of-the-art approaches on spurious correlation benchmarks.
We also show that last layer retraining on large ImageNet-trained models can significantly reduce reliance on background and texture information.
arXiv Detail & Related papers (2022-04-06T16:55:41Z) - With Greater Distance Comes Worse Performance: On the Perspective of
Layer Utilization and Model Generalization [3.6321778403619285]
Generalization of deep neural networks remains one of the main open problems in machine learning.
Early layers generally learn representations relevant to performance on both training data and testing data.
Deeper layers only minimize training risks and fail to generalize well with testing or mislabeled data.
arXiv Detail & Related papers (2022-01-28T05:26:32Z) - On the Robustness of Pretraining and Self-Supervision for a Deep
Learning-based Analysis of Diabetic Retinopathy [70.71457102672545]
We compare the impact of different training procedures for diabetic retinopathy grading.
We investigate different aspects such as quantitative performance, statistics of the learned feature representations, interpretability and robustness to image distortions.
Our results indicate that models from ImageNet pretraining report a significant increase in performance, generalization and robustness to image distortions.
arXiv Detail & Related papers (2021-06-25T08:32:45Z) - A Bayesian Perspective on Training Speed and Model Selection [51.15664724311443]
We show that a measure of a model's training speed can be used to estimate its marginal likelihood.
We verify our results in model selection tasks for linear models and for the infinite-width limit of deep neural networks.
Our results suggest a promising new direction towards explaining why neural networks trained with gradient descent are biased towards functions that generalize well.
arXiv Detail & Related papers (2020-10-27T17:56:14Z) - On the Generalization Properties of Adversarial Training [21.79888306754263]
This paper studies the generalization performance of a generic adversarial training algorithm.
A series of numerical studies are conducted to demonstrate how the smoothness and L1 penalization help improve the adversarial robustness of models.
arXiv Detail & Related papers (2020-08-15T02:32:09Z) - Retrospective Loss: Looking Back to Improve Training of Deep Neural
Networks [15.329684157845872]
We introduce a new retrospective loss to improve the training of deep neural network models.
Minimizing the retrospective loss, along with the task-specific loss, pushes the parameter state at the current training step towards the optimal parameter state.
Although a simple idea, we analyze the method as well as to conduct comprehensive sets of experiments across domains.
arXiv Detail & Related papers (2020-06-24T10:16:36Z) - The large learning rate phase of deep learning: the catapult mechanism [50.23041928811575]
We present a class of neural networks with solvable training dynamics.
We find good agreement between our model's predictions and training dynamics in realistic deep learning settings.
We believe our results shed light on characteristics of models trained at different learning rates.
arXiv Detail & Related papers (2020-03-04T17:52:48Z) - Overfitting in adversarially robust deep learning [86.11788847990783]
We show that overfitting to the training set does in fact harm robust performance to a very large degree in adversarially robust training.
We also show that effects such as the double descent curve do still occur in adversarially trained models, yet fail to explain the observed overfitting.
arXiv Detail & Related papers (2020-02-26T15:40:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.