Related papers: On Regularization via Early Stopping for Least Squares Regression

On Regularization via Early Stopping for Least Squares Regression

URL: http://arxiv.org/abs/2406.04425v1
Date: Thu, 6 Jun 2024 18:10:51 GMT
Title: On Regularization via Early Stopping for Least Squares Regression
Authors: Rishi Sonthalia, Jackie Lok, Elizaveta Rebrova,
Abstract summary: We prove that early stopping is beneficial for generic data with arbitrary spectrum and for a wide variety of learning rate schedules. We provide an estimate for the optimal stopping time and empirically demonstrate the accuracy of our estimate.
Score: 4.159762735751163
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A fundamental problem in machine learning is understanding the effect of early stopping on the parameters obtained and the generalization capabilities of the model. Even for linear models, the effect is not fully understood for arbitrary learning rates and data. In this paper, we analyze the dynamics of discrete full batch gradient descent for linear regression. With minimal assumptions, we characterize the trajectory of the parameters and the expected excess risk. Using this characterization, we show that when training with a learning rate schedule $\eta_k$, and a finite time horizon $T$, the early stopped solution $\beta_T$ is equivalent to the minimum norm solution for a generalized ridge regularized problem. We also prove that early stopping is beneficial for generic data with arbitrary spectrum and for a wide variety of learning rate schedules. We provide an estimate for the optimal stopping time and empirically demonstrate the accuracy of our estimate.

Related papers

A Simplified Analysis of SGD for Linear Regression with Weight Averaging [64.2393952273612]
Recent work bycitetzou 2021benign provides sharp rates for SGD optimization in linear regression using constant learning rate.<n>We provide a simplified analysis recovering the same bias and variance bounds provided incitepzou 2021benign based on simple linear algebra tools.<n>We believe our work makes the analysis of gradient descent on linear regression very accessible and will be helpful in further analyzing mini-batching and learning rate scheduling.
arXiv Detail & Related papers (2025-06-18T15:10:38Z)
A Statistical Theory of Regularization-Based Continual Learning [10.899175512941053]
We provide a statistical analysis of regularization-based continual learning on a sequence of linear regression tasks. We first derive the convergence rate for the oracle estimator obtained as if all data were available simultaneously. A byproduct of our theoretical analysis is the equivalence between early stopping and generalized $ell$-regularization.
arXiv Detail & Related papers (2024-06-10T12:25:13Z)
Scaling and renormalization in high-dimensional regression [72.59731158970894]
This paper presents a succinct derivation of the training and generalization performance of a variety of high-dimensional ridge regression models. We provide an introduction and review of recent results on these topics, aimed at readers with backgrounds in physics and deep learning.
arXiv Detail & Related papers (2024-05-01T15:59:00Z)
Theoretical Characterization of the Generalization Performance of Overfitted Meta-Learning [70.52689048213398]
This paper studies the performance of overfitted meta-learning under a linear regression model with Gaussian features. We find new and interesting properties that do not exist in single-task linear regression. Our analysis suggests that benign overfitting is more significant and easier to observe when the noise and the diversity/fluctuation of the ground truth of each training task are large.
arXiv Detail & Related papers (2023-04-09T20:36:13Z)
Learning Rate Schedules in the Presence of Distribution Shift [18.310336156637774]
We design learning schedules that regret networks cumulatively learning in the presence of a changing data distribution. We provide experiments for high-dimensional regression models to increase regret models.
arXiv Detail & Related papers (2023-03-27T23:29:02Z)
Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks. We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space. We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z)
Precise Learning Curves and Higher-Order Scaling Limits for Dot Product Kernel Regression [41.48538038768993]
We focus on the problem of kernel ridge regression for dot-product kernels. We observe a peak in the learning curve whenever $m approx dr/r!$ for any integer $r$, leading to multiple sample-wise descent and nontrivial behavior at multiple scales.
arXiv Detail & Related papers (2022-05-30T04:21:31Z)
Time varying regression with hidden linear dynamics [74.9914602730208]
We revisit a model for time-varying linear regression that assumes the unknown parameters evolve according to a linear dynamical system. Counterintuitively, we show that when the underlying dynamics are stable the parameters of this model can be estimated from data by combining just two ordinary least squares estimates.
arXiv Detail & Related papers (2021-12-29T23:37:06Z)
Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation. We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation. Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z)
Sparse approximation in learning via neural ODEs [0.0]
We study the impact of the final time horizon $T$ in training. In practical terms, a shorter time-horizon in the training problem can be interpreted as considering a shallower residual neural network.
arXiv Detail & Related papers (2021-02-26T16:23:02Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)
Fundamental Limits of Ridge-Regularized Empirical Risk Minimization in High Dimensions [41.7567932118769]
Empirical Risk Minimization algorithms are widely used in a variety of estimation and prediction tasks. In this paper, we characterize for the first time the fundamental limits on the statistical accuracy of convex ERM for inference.
arXiv Detail & Related papers (2020-06-16T04:27:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.