On the Universality of the Double Descent Peak in Ridgeless Regression
- URL: http://arxiv.org/abs/2010.01851v8
- Date: Tue, 1 Aug 2023 12:36:42 GMT
- Title: On the Universality of the Double Descent Peak in Ridgeless Regression
- Authors: David Holzm\"uller
- Abstract summary: We prove a non-asymptotic distribution-independent lower bound for the expected mean squared error caused by label noise in ridgeless linear regression.
Our lower bound generalizes a similar known result to the overized (interpolating) regime.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We prove a non-asymptotic distribution-independent lower bound for the
expected mean squared generalization error caused by label noise in ridgeless
linear regression. Our lower bound generalizes a similar known result to the
overparameterized (interpolating) regime. In contrast to most previous works,
our analysis applies to a broad class of input distributions with almost surely
full-rank feature matrices, which allows us to cover various types of
deterministic or random feature maps. Our lower bound is asymptotically sharp
and implies that in the presence of label noise, ridgeless linear regression
does not perform well around the interpolation threshold for any of these
feature maps. We analyze the imposed assumptions in detail and provide a theory
for analytic (random) feature maps. Using this theory, we can show that our
assumptions are satisfied for input distributions with a (Lebesgue) density and
feature maps given by random deep neural networks with analytic activation
functions like sigmoid, tanh, softplus or GELU. As further examples, we show
that feature maps from random Fourier features and polynomial kernels also
satisfy our assumptions. We complement our theory with further experimental and
analytic results.
Related papers
- Dimension-free deterministic equivalents and scaling laws for random feature regression [11.607594737176973]
We show that the test error is well approximated by a closed-form expression that only depends on the feature map eigenvalues.
Notably, our approximation guarantee is non-asymptotic, multiplicative, and independent of the feature map dimension.
arXiv Detail & Related papers (2024-05-24T16:43:26Z) - Learning Linear Causal Representations from Interventions under General
Nonlinear Mixing [52.66151568785088]
We prove strong identifiability results given unknown single-node interventions without access to the intervention targets.
This is the first instance of causal identifiability from non-paired interventions for deep neural network embeddings.
arXiv Detail & Related papers (2023-06-04T02:32:12Z) - Kernel-based off-policy estimation without overlap: Instance optimality
beyond semiparametric efficiency [53.90687548731265]
We study optimal procedures for estimating a linear functional based on observational data.
For any convex and symmetric function class $mathcalF$, we derive a non-asymptotic local minimax bound on the mean-squared error.
arXiv Detail & Related papers (2023-01-16T02:57:37Z) - Ridgeless Regression with Random Features [23.41536146432726]
We investigate the statistical properties of ridgeless regression with random features and gradient descent.
We propose a tunable kernel algorithm that optimize the spectral density of kernel during training.
arXiv Detail & Related papers (2022-05-01T14:25:08Z) - Fluctuations, Bias, Variance & Ensemble of Learners: Exact Asymptotics
for Convex Losses in High-Dimension [25.711297863946193]
We develop a theory for the study of fluctuations in an ensemble of generalised linear models trained on different, but correlated, features.
We provide a complete description of the joint distribution of the empirical risk minimiser for generic convex loss and regularisation in the high-dimensional limit.
arXiv Detail & Related papers (2022-01-31T17:44:58Z) - Nonconvex Stochastic Scaled-Gradient Descent and Generalized Eigenvector
Problems [98.34292831923335]
Motivated by the problem of online correlation analysis, we propose the emphStochastic Scaled-Gradient Descent (SSD) algorithm.
We bring these ideas together in an application to online correlation analysis, deriving for the first time an optimal one-time-scale algorithm with an explicit rate of local convergence to normality.
arXiv Detail & Related papers (2021-12-29T18:46:52Z) - Harmless interpolation in regression and classification with structured
features [21.064512161584872]
Overparametrized neural networks tend to perfectly fit noisy training data yet generalize well on test data.
We present a general and flexible framework for upper bounding regression and classification risk in a reproducing kernel Hilbert space.
arXiv Detail & Related papers (2021-11-09T15:12:26Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Predicting Unreliable Predictions by Shattering a Neural Network [145.3823991041987]
Piecewise linear neural networks can be split into subfunctions.
Subfunctions have their own activation pattern, domain, and empirical error.
Empirical error for the full network can be written as an expectation over subfunctions.
arXiv Detail & Related papers (2021-06-15T18:34:41Z) - Adversarial Estimation of Riesz Representers [21.510036777607397]
We propose an adversarial framework to estimate the Riesz representer using general function spaces.
We prove a nonasymptotic mean square rate in terms of an abstract quantity called the critical radius, then specialize it for neural networks, random forests, and reproducing kernel Hilbert spaces as leading cases.
arXiv Detail & Related papers (2020-12-30T19:46:57Z) - Bayesian Deep Learning and a Probabilistic Perspective of Generalization [56.69671152009899]
We show that deep ensembles provide an effective mechanism for approximate Bayesian marginalization.
We also propose a related approach that further improves the predictive distribution by marginalizing within basins of attraction.
arXiv Detail & Related papers (2020-02-20T15:13:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.