Related papers: Dropout Regularization Versus $\ell_2$-Penalization in the Linear Model

Dropout Regularization Versus $\ell_2$-Penalization in the Linear Model

URL: http://arxiv.org/abs/2306.10529v2
Date: Thu, 25 Apr 2024 13:53:09 GMT
Title: Dropout Regularization Versus $\ell_2$-Penalization in the Linear Model
Authors: Gabriel Clara, Sophie Langer, Johannes Schmidt-Hieber,
Abstract summary: We study the statistical behavior of gradient descent iterates with dropout in the linear regression model. We indicate a more subtle relationship, owing to interactions between the gradient descent dynamics and the additional randomness induced by dropout.
Score: 7.032245866317619
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We investigate the statistical behavior of gradient descent iterates with dropout in the linear regression model. In particular, non-asymptotic bounds for the convergence of expectations and covariance matrices of the iterates are derived. The results shed more light on the widely cited connection between dropout and l2-regularization in the linear model. We indicate a more subtle relationship, owing to interactions between the gradient descent dynamics and the additional randomness induced by dropout. Further, we study a simplified variant of dropout which does not have a regularizing effect and converges to the least squares estimator

Related papers

Entropic Mirror Descent for Linear Systems: Polyak's Stepsize and Implicit Bias [55.72269695392027]
This paper focuses on applying entropic mirror descent to solve linear systems.<n>The main challenge for the convergence analysis stems from the unboundedness of the domain.<n>To overcome this without imposing restrictive assumptions, we introduce a variant of Polyak-type stepsizes.
arXiv Detail & Related papers (2025-05-05T12:33:18Z)
Comparing regularisation paths of (conjugate) gradient estimators in ridge regression [0.0]
We consider gradient descent, gradient flow and conjugate gradients as iterative algorithms for minimizing a penalized ridge criterion in linear regression. In particular, the oracle conjugate gradient iterate shares the optimality properties of the gradient flow and ridge regression oracles up to a constant factor.
arXiv Detail & Related papers (2025-03-07T16:14:06Z)
Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models [76.52307406752556]
We derive a novel deterministic equivalence for the two-point function of a random resolvent. We give a unified derivation of the performance of a wide variety of high-dimensional trained linear models with gradient descent.
arXiv Detail & Related papers (2025-02-07T16:45:40Z)
A U-turn on Double Descent: Rethinking Parameter Counting in Statistical Learning [68.76846801719095]
We show that double descent appears exactly when and where it occurs, and that its location is not inherently tied to the threshold p=n. This provides a resolution to tensions between double descent and statistical intuition.
arXiv Detail & Related papers (2023-10-29T12:05:39Z)
Convergence guarantees for forward gradient descent in the linear regression model [5.448070998907116]
We study the biologically motivated (weight-perturbed) forward gradient scheme that is based on random linear combination of the gradient. We prove that the mean squared error of this method converges for $kgtrsim d2log(d)$ with rate $d2log(d)/k.$ Compared to the dimension dependence d for gradient descent, an additional factor $dlog(d)$ occurs.
arXiv Detail & Related papers (2023-09-26T15:15:10Z)
Dynamical chaos in nonlinear Schr\"odinger models with subquadratic power nonlinearity [137.6408511310322]
We deal with a class of nonlinear Schr"odinger lattices with random potential and subquadratic power nonlinearity. We show that the spreading process is subdiffusive and has complex microscopic organization. The limit of quadratic power nonlinearity is also discussed and shown to result in a delocalization border.
arXiv Detail & Related papers (2023-01-20T16:45:36Z)
Gradient flow in the gaussian covariate model: exact solution of learning curves and multiple descent structures [14.578025146641806]
We provide a full and unified analysis of the whole time-evolution of the generalization curve. We show that our theoretical predictions adequately match the learning curves obtained by gradient descent over realistic datasets.
arXiv Detail & Related papers (2022-12-13T17:39:18Z)
A Unified Analysis of Multi-task Functional Linear Regression Models with Manifold Constraint and Composite Quadratic Penalty [0.0]
The power of multi-task learning is brought in by imposing additional structures over the slope functions. We show the composite penalty induces a specific norm, which helps to quantify the manifold curvature. A unified convergence upper bound is obtained and specifically applied to the reduced-rank model and the graph Laplacian regularized model.
arXiv Detail & Related papers (2022-11-09T13:32:23Z)
The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks [51.1848572349154]
neural network models that perfectly fit noisy data can generalize well to unseen test data. We consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk.
arXiv Detail & Related papers (2021-08-25T22:01:01Z)
On the Role of Optimization in Double Descent: A Least Squares Study [30.44215064390409]
We show an excess risk bound for the descent gradient solution of the least squares objective. We find that in case of noiseless regression, double descent is explained solely by optimization-related quantities. We empirically explore if our predictions hold for neural networks.
arXiv Detail & Related papers (2021-07-27T09:13:11Z)
Optimization Variance: Exploring Generalization Properties of DNNs [83.78477167211315]
The test error of a deep neural network (DNN) often demonstrates double descent. We propose a novel metric, optimization variance (OV), to measure the diversity of model updates.
arXiv Detail & Related papers (2021-06-03T09:34:17Z)
Lower Bounds on the Generalization Error of Nonlinear Learning Models [2.1030878979833467]
We study in this paper lower bounds for the generalization error of models derived from multi-layer neural networks, in the regime where the size of the layers is commensurate with the number of samples in the training data. We show that unbiased estimators have unacceptable performance for such nonlinear networks in this regime. We derive explicit generalization lower bounds for general biased estimators, in the cases of linear regression and of two-layered networks.
arXiv Detail & Related papers (2021-03-26T20:37:54Z)
Understanding Implicit Regularization in Over-Parameterized Single Index Model [55.41685740015095]
We design regularization-free algorithms for the high-dimensional single index model. We provide theoretical guarantees for the induced implicit regularization phenomenon.
arXiv Detail & Related papers (2020-07-16T13:27:47Z)
Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated. We propose a new method for this estimation problem combining sampling and analytic approximation steps. We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.