Learning Rate Schedules in the Presence of Distribution Shift
- URL: http://arxiv.org/abs/2303.15634v2
- Date: Sun, 20 Aug 2023 15:57:07 GMT
- Title: Learning Rate Schedules in the Presence of Distribution Shift
- Authors: Matthew Fahrbach, Adel Javanmard, Vahab Mirrokni, Pratik Worah
- Abstract summary: We design learning schedules that regret networks cumulatively learning in the presence of a changing data distribution.
We provide experiments for high-dimensional regression models to increase regret models.
- Score: 18.310336156637774
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We design learning rate schedules that minimize regret for SGD-based online
learning in the presence of a changing data distribution. We fully characterize
the optimal learning rate schedule for online linear regression via a novel
analysis with stochastic differential equations. For general convex loss
functions, we propose new learning rate schedules that are robust to
distribution shift and we give upper and lower bounds for the regret that only
differ by constants. For non-convex loss functions, we define a notion of
regret based on the gradient norm of the estimated models and propose a
learning schedule that minimizes an upper bound on the total expected regret.
Intuitively, one expects changing loss landscapes to require more exploration,
and we confirm that optimal learning rate schedules typically increase in the
presence of distribution shift. Finally, we provide experiments for
high-dimensional regression models and neural networks to illustrate these
learning rate schedules and their cumulative regret.
Related papers
- On Regularization via Early Stopping for Least Squares Regression [4.159762735751163]
We prove that early stopping is beneficial for generic data with arbitrary spectrum and for a wide variety of learning rate schedules.
We provide an estimate for the optimal stopping time and empirically demonstrate the accuracy of our estimate.
arXiv Detail & Related papers (2024-06-06T18:10:51Z) - Investigating the Histogram Loss in Regression [16.83443393563771]
Histogram Loss is a regression approach to learning the conditional distribution of a target variable.
We show that the benefits of learning distributions in this setup come from improvements in optimization rather than modelling extra information.
arXiv Detail & Related papers (2024-02-20T23:29:41Z) - Future Gradient Descent for Adapting the Temporal Shifting Data
Distribution in Online Recommendation Systems [30.88268793277078]
We learn a meta future gradient generator that forecasts the gradient information of the future data distribution for training.
Compared with Batch Update, our theory suggests that the proposed algorithm achieves smaller temporal domain generalization error.
arXiv Detail & Related papers (2022-09-02T15:55:31Z) - Near-optimal Offline Reinforcement Learning with Linear Representation:
Leveraging Variance Information with Pessimism [65.46524775457928]
offline reinforcement learning seeks to utilize offline/historical data to optimize sequential decision-making strategies.
We study the statistical limits of offline reinforcement learning with linear model representations.
arXiv Detail & Related papers (2022-03-11T09:00:12Z) - Domain-Adjusted Regression or: ERM May Already Learn Features Sufficient
for Out-of-Distribution Generalization [52.7137956951533]
We argue that devising simpler methods for learning predictors on existing features is a promising direction for future research.
We introduce Domain-Adjusted Regression (DARE), a convex objective for learning a linear predictor that is provably robust under a new model of distribution shift.
Under a natural model, we prove that the DARE solution is the minimax-optimal predictor for a constrained set of test distributions.
arXiv Detail & Related papers (2022-02-14T16:42:16Z) - Reducing Representation Drift in Online Continual Learning [87.71558506591937]
We study the online continual learning paradigm, where agents must learn from a changing distribution with constrained memory and compute.
In this work we instead focus on the change in representations of previously observed data due to the introduction of previously unobserved class samples in the incoming data stream.
arXiv Detail & Related papers (2021-04-11T15:19:30Z) - A Bayesian Perspective on Training Speed and Model Selection [51.15664724311443]
We show that a measure of a model's training speed can be used to estimate its marginal likelihood.
We verify our results in model selection tasks for linear models and for the infinite-width limit of deep neural networks.
Our results suggest a promising new direction towards explaining why neural networks trained with gradient descent are biased towards functions that generalize well.
arXiv Detail & Related papers (2020-10-27T17:56:14Z) - Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function.
We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model.
We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.