A finite sample analysis of the benign overfitting phenomenon for ridge
function estimation
- URL: http://arxiv.org/abs/2007.12882v5
- Date: Fri, 12 Jan 2024 18:04:59 GMT
- Title: A finite sample analysis of the benign overfitting phenomenon for ridge
function estimation
- Authors: Emmanuel Caron and Stephane Chretien
- Abstract summary: We propose a finite sample analysis of non-linear models of textitridge type.
We investigate the textitoverparametrised regime of the double descent phenomenon for both the textitestimation problem and the textitprediction problem.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent extensive numerical experiments in high scale machine learning have
allowed to uncover a quite counterintuitive phase transition, as a function of
the ratio between the sample size and the number of parameters in the model. As
the number of parameters $p$ approaches the sample size $n$, the generalisation
error increases, but surprisingly, it starts decreasing again past the
threshold $p=n$. This phenomenon, brought to the theoretical community
attention in \cite{belkin2019reconciling}, has been thoroughly investigated
lately, more specifically for simpler models than deep neural networks, such as
the linear model when the parameter is taken to be the minimum norm solution to
the least-squares problem, firstly in the asymptotic regime when $p$ and $n$
tend to infinity, see e.g. \cite{hastie2019surprises}, and recently in the
finite dimensional regime and more specifically for linear models
\cite{bartlett2020benign}, \cite{tsigler2020benign},
\cite{lecue2022geometrical}. In the present paper, we propose a finite sample
analysis of non-linear models of \textit{ridge} type, where we investigate the
\textit{overparametrised regime} of the double descent phenomenon for both the
\textit{estimation problem} and the \textit{prediction} problem. Our results
provide a precise analysis of the distance of the best estimator from the true
parameter as well as a generalisation bound which complements recent works of
\cite{bartlett2020benign} and \cite{chinot2020benign}. Our analysis is based on
tools closely related to the continuous Newton method
\cite{neuberger2007continuous} and a refined quantitative analysis of the
performance in prediction of the minimum $\ell_2$-norm solution.
Related papers
- A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Global Convergence of Over-parameterized Deep Equilibrium Models [52.65330015267245]
A deep equilibrium model (DEQ) is implicitly defined through an equilibrium point of an infinite-depth weight-tied model with an input-injection.
Instead of infinite computations, it solves an equilibrium point directly with root-finding and computes gradients with implicit differentiation.
We propose a novel probabilistic framework to overcome the technical difficulty in the non-asymptotic analysis of infinite-depth weight-tied models.
arXiv Detail & Related papers (2022-05-27T08:00:13Z) - Single Trajectory Nonparametric Learning of Nonlinear Dynamics [8.438421942654292]
Given a single trajectory of a dynamical system, we analyze the performance of the nonparametric least squares estimator (LSE)
We leverage recently developed information-theoretic methods to establish the optimality of the LSE for non hypotheses classes.
We specialize our results to a number of scenarios of practical interest, such as Lipschitz dynamics, generalized linear models, and dynamics described by functions in certain classes of Reproducing Kernel Hilbert Spaces (RKHS)
arXiv Detail & Related papers (2022-02-16T19:38:54Z) - Nonconvex Stochastic Scaled-Gradient Descent and Generalized Eigenvector
Problems [98.34292831923335]
Motivated by the problem of online correlation analysis, we propose the emphStochastic Scaled-Gradient Descent (SSD) algorithm.
We bring these ideas together in an application to online correlation analysis, deriving for the first time an optimal one-time-scale algorithm with an explicit rate of local convergence to normality.
arXiv Detail & Related papers (2021-12-29T18:46:52Z) - Heavy-tailed Streaming Statistical Estimation [58.70341336199497]
We consider the task of heavy-tailed statistical estimation given streaming $p$ samples.
We design a clipped gradient descent and provide an improved analysis under a more nuanced condition on the noise of gradients.
arXiv Detail & Related papers (2021-08-25T21:30:27Z) - Max-Linear Regression by Convex Programming [5.366354612549172]
We formulate and analyze a scalable convex program given by anchored regression (AR) as the estimator for the max-linear regression problem.
Our result shows a sufficient number of noise-free observations for exact recovery scales as $k4p$ up to a logarithmic factor.
arXiv Detail & Related papers (2021-03-12T00:55:54Z) - Last iterate convergence of SGD for Least-Squares in the Interpolation
regime [19.05750582096579]
We study the noiseless model in the fundamental least-squares setup.
We assume that an optimum predictor fits perfectly inputs and outputs $langle theta_*, phi(X) rangle = Y$, where $phi(X)$ stands for a possibly infinite dimensional non-linear feature map.
arXiv Detail & Related papers (2021-02-05T14:02:20Z) - The Generalized Lasso with Nonlinear Observations and Generative Priors [63.541900026673055]
We make the assumption of sub-Gaussian measurements, which is satisfied by a wide range of measurement models.
We show that our result can be extended to the uniform recovery guarantee under the assumption of a so-called local embedding property.
arXiv Detail & Related papers (2020-06-22T16:43:35Z) - Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated.
We propose a new method for this estimation problem combining sampling and analytic approximation steps.
We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z) - A Precise High-Dimensional Asymptotic Theory for Boosting and
Minimum-$\ell_1$-Norm Interpolated Classifiers [3.167685495996986]
This paper establishes a precise high-dimensional theory for boosting on separable data.
Under a class of statistical models, we provide an exact analysis of the universality error of boosting.
We also explicitly pin down the relation between the boosting test error and the optimal Bayes error.
arXiv Detail & Related papers (2020-02-05T00:24:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.