Related papers: Learning Rate Schedules in the Presence of Distribution Shift

Learning Rate Schedules in the Presence of Distribution Shift

URL: http://arxiv.org/abs/2303.15634v2
Date: Sun, 20 Aug 2023 15:57:07 GMT
Title: Learning Rate Schedules in the Presence of Distribution Shift
Authors: Matthew Fahrbach, Adel Javanmard, Vahab Mirrokni, Pratik Worah
Abstract summary: We design learning schedules that regret networks cumulatively learning in the presence of a changing data distribution. We provide experiments for high-dimensional regression models to increase regret models.
Score: 18.310336156637774
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We design learning rate schedules that minimize regret for SGD-based online learning in the presence of a changing data distribution. We fully characterize the optimal learning rate schedule for online linear regression via a novel analysis with stochastic differential equations. For general convex loss functions, we propose new learning rate schedules that are robust to distribution shift and we give upper and lower bounds for the regret that only differ by constants. For non-convex loss functions, we define a notion of regret based on the gradient norm of the estimated models and propose a learning schedule that minimizes an upper bound on the total expected regret. Intuitively, one expects changing loss landscapes to require more exploration, and we confirm that optimal learning rate schedules typically increase in the presence of distribution shift. Finally, we provide experiments for high-dimensional regression models and neural networks to illustrate these learning rate schedules and their cumulative regret.

Related papers

Adaptive Batch Size and Learning Rate Scheduler for Stochastic Gradient Descent Based on Minimization of Stochastic First-order Oracle Complexity [0.6906005491572401]
The convergence behavior of mini-batch gradient descent (SGD) is highly sensitive to the batch size and learning rate settings.<n>Recent theoretical studies have identified the existence of a critical batch size that minimizes first-order oracle complexity.<n>An adaptive scheduling strategy is introduced to accelerate SGD that leverages theoretical findings on the critical batch size.
arXiv Detail & Related papers (2025-08-07T12:00:53Z)
A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules [67.87680482844884]
We present an empirical law that describes how the pretraining loss of large language models evolves under different learning rate schedules. Our proposed law takes a multi-power form, combining a power law based on the sum of learning rates and additional power laws to account for a loss reduction effect induced by learning rate decay.
arXiv Detail & Related papers (2025-03-17T04:36:45Z)
On Regularization via Early Stopping for Least Squares Regression [4.159762735751163]
We prove that early stopping is beneficial for generic data with arbitrary spectrum and for a wide variety of learning rate schedules. We provide an estimate for the optimal stopping time and empirically demonstrate the accuracy of our estimate.
arXiv Detail & Related papers (2024-06-06T18:10:51Z)
Investigating the Histogram Loss in Regression [16.83443393563771]
Histogram Loss is a regression approach to learning the conditional distribution of a target variable. We show that the benefits of learning distributions in this setup come from improvements in optimization rather than modelling extra information.
arXiv Detail & Related papers (2024-02-20T23:29:41Z)
Future Gradient Descent for Adapting the Temporal Shifting Data Distribution in Online Recommendation Systems [30.88268793277078]
We learn a meta future gradient generator that forecasts the gradient information of the future data distribution for training. Compared with Batch Update, our theory suggests that the proposed algorithm achieves smaller temporal domain generalization error.
arXiv Detail & Related papers (2022-09-02T15:55:31Z)
Near-optimal Offline Reinforcement Learning with Linear Representation: Leveraging Variance Information with Pessimism [65.46524775457928]
offline reinforcement learning seeks to utilize offline/historical data to optimize sequential decision-making strategies. We study the statistical limits of offline reinforcement learning with linear model representations.
arXiv Detail & Related papers (2022-03-11T09:00:12Z)
Domain-Adjusted Regression or: ERM May Already Learn Features Sufficient for Out-of-Distribution Generalization [52.7137956951533]
We argue that devising simpler methods for learning predictors on existing features is a promising direction for future research. We introduce Domain-Adjusted Regression (DARE), a convex objective for learning a linear predictor that is provably robust under a new model of distribution shift. Under a natural model, we prove that the DARE solution is the minimax-optimal predictor for a constrained set of test distributions.
arXiv Detail & Related papers (2022-02-14T16:42:16Z)
Reducing Representation Drift in Online Continual Learning [87.71558506591937]
We study the online continual learning paradigm, where agents must learn from a changing distribution with constrained memory and compute. In this work we instead focus on the change in representations of previously observed data due to the introduction of previously unobserved class samples in the incoming data stream.
arXiv Detail & Related papers (2021-04-11T15:19:30Z)
A Bayesian Perspective on Training Speed and Model Selection [51.15664724311443]
We show that a measure of a model's training speed can be used to estimate its marginal likelihood. We verify our results in model selection tasks for linear models and for the infinite-width limit of deep neural networks. Our results suggest a promising new direction towards explaining why neural networks trained with gradient descent are biased towards functions that generalize well.
arXiv Detail & Related papers (2020-10-27T17:56:14Z)
Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function. We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model. We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.