Related papers: The Two Regimes of Deep Network Training

The Two Regimes of Deep Network Training

URL: http://arxiv.org/abs/2002.10376v1
Date: Mon, 24 Feb 2020 17:08:24 GMT
Title: The Two Regimes of Deep Network Training
Authors: Guillaume Leclerc, Aleksander Madry
Abstract summary: We study the effects of different learning schedules and the appropriate way to select them. To this end, we isolate two distinct phases, which we refer to as the "large-step regime" and the "small-step regime" Our training algorithm can significantly simplify learning rate schedules.
Score: 93.84309968956941
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Learning rate schedule has a major impact on the performance of deep learning models. Still, the choice of a schedule is often heuristical. We aim to develop a precise understanding of the effects of different learning rate schedules and the appropriate way to select them. To this end, we isolate two distinct phases of training, the first, which we refer to as the "large-step" regime, exhibits a rather poor performance from an optimization point of view but is the primary contributor to model generalization; the latter, "small-step" regime exhibits much more "convex-like" optimization behavior but used in isolation produces models that generalize poorly. We find that by treating these regimes separately-and em specializing our training algorithm to each one of them, we can significantly simplify learning rate schedules.

Related papers

First-Passage Approach to Optimizing Perturbations for Improved Training of Machine Learning Models [0.0]
Several protocols have been developed to improve the training of machine learning models. We frame them as first-passage processes and consider their response to perturbations. We show that if the unperturbed learning process reaches a quasi-steady state, the response at a single perturbation frequency can predict the behavior at a wide range of timescales.
arXiv Detail & Related papers (2025-02-06T14:53:21Z)
The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training [55.233765889424035]
We show that learning-rate schedules for large model training behave surprisingly similar to a convex bound from non-smooth optimization theory. We achieve noticeable improvements for training 124M and 210M Llama-type models by (i) extending the schedule for continued training with optimal learning-rate, and (ii) transferring the optimal learning-rate across schedules.
arXiv Detail & Related papers (2025-01-31T08:55:56Z)
Understanding Optimization in Deep Learning with Central Flows [53.66160508990508]
We show that an RMS's implicit behavior can be explicitly captured by a "central flow:" a differential equation. We show that these flows can empirically predict long-term optimization trajectories of generic neural networks.
arXiv Detail & Related papers (2024-10-31T17:58:13Z)
Unlearning as multi-task optimization: A normalized gradient difference approach with an adaptive learning rate [105.86576388991713]
We introduce a normalized gradient difference (NGDiff) algorithm, enabling us to have better control over the trade-off between the objectives. We provide a theoretical analysis and empirically demonstrate the superior performance of NGDiff among state-of-the-art unlearning methods on the TOFU and MUSE datasets.
arXiv Detail & Related papers (2024-10-29T14:41:44Z)
Narrowing the Focus: Learned Optimizers for Pretrained Models [24.685918556547055]
We propose a novel technique that learns a layer-specific linear combination of update directions provided by a set of base work tasks. When evaluated on an image, this specialized significantly outperforms both traditional off-the-shelf methods such as Adam, as well existing general learneds.
arXiv Detail & Related papers (2024-08-17T23:55:19Z)
EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training [79.96741042766524]
We reformulate the training curriculum as a soft-selection function. We show that exposing the contents of natural images can be readily achieved by the intensity of data augmentation. The resulting method, EfficientTrain++, is simple, general, yet surprisingly effective.
arXiv Detail & Related papers (2024-05-14T17:00:43Z)
Optimal Linear Decay Learning Rate Schedules and Further Refinements [46.79573408189601]
Learning rate schedules used in practice bear little resemblance to those recommended by theory. We close much of this theory/practice gap, and as a consequence are able to derive new problem-adaptive learning rate schedules.
arXiv Detail & Related papers (2023-10-11T19:16:35Z)
Towards Compute-Optimal Transfer Learning [82.88829463290041]
We argue that zero-shot structured pruning of pretrained models allows them to increase compute efficiency with minimal reduction in performance. Our results show that pruning convolutional filters of pretrained models can lead to more than 20% performance improvement in low computational regimes.
arXiv Detail & Related papers (2023-04-25T21:49:09Z)
Joint Training of Deep Ensembles Fails Due to Learner Collusion [61.557412796012535]
Ensembles of machine learning models have been well established as a powerful method of improving performance over a single model. Traditionally, ensembling algorithms train their base learners independently or sequentially with the goal of optimizing their joint performance. We show that directly minimizing the loss of the ensemble appears to rarely be applied in practice.
arXiv Detail & Related papers (2023-01-26T18:58:07Z)
Learning Rate Perturbation: A Generic Plugin of Learning Rate Schedule towards Flatter Local Minima [40.70374106466073]
We propose a generic learning rate schedule plugin called LEArning Rate Perturbation (LEAP) LEAP can be applied to various learning rate schedules to improve the model training by introducing a certain perturbation to the learning rate. We conduct extensive experiments which show that training with LEAP can improve the performance of various deep learning models on diverse datasets.
arXiv Detail & Related papers (2022-08-25T05:05:18Z)
Automatic Tuning of Stochastic Gradient Descent with Bayesian Optimisation [8.340191147575307]
We introduce an original probabilistic model for traces of optimisers, based on latent Gaussian processes and an auto-/regressive formulation. It flexibly adjusts to abrupt changes of behaviours induced by new learning rate values. It is well-suited to tackle a set of problems: first, for the on-line adaptation of the learning rate for a cold-started run; then, for tuning the schedule for a set of similar tasks, as well as warm-starting it for a new task.
arXiv Detail & Related papers (2020-06-25T13:18:18Z)
Auto-Ensemble: An Adaptive Learning Rate Scheduling based Deep Learning Model Ensembling [11.324407834445422]
This paper proposes Auto-Ensemble (AE) to collect checkpoints of deep learning model and ensemble them automatically. The advantage of this method is to make the model converge to various local optima by scheduling the learning rate in once training.
arXiv Detail & Related papers (2020-03-25T08:17:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.