Memorizing without overfitting: Bias, variance, and interpolation in
over-parameterized models
- URL: http://arxiv.org/abs/2010.13933v4
- Date: Thu, 24 Feb 2022 16:38:45 GMT
- Title: Memorizing without overfitting: Bias, variance, and interpolation in
over-parameterized models
- Authors: Jason W. Rocks and Pankaj Mehta
- Abstract summary: The bias-variance trade-off is a central concept in supervised learning.
Modern Deep Learning methods flout this dogma, achieving state-of-the-art performance.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The bias-variance trade-off is a central concept in supervised learning. In
classical statistics, increasing the complexity of a model (e.g., number of
parameters) reduces bias but also increases variance. Until recently, it was
commonly believed that optimal performance is achieved at intermediate model
complexities which strike a balance between bias and variance. Modern Deep
Learning methods flout this dogma, achieving state-of-the-art performance using
"over-parameterized models" where the number of fit parameters is large enough
to perfectly fit the training data. As a result, understanding bias and
variance in over-parameterized models has emerged as a fundamental problem in
machine learning. Here, we use methods from statistical physics to derive
analytic expressions for bias and variance in two minimal models of
over-parameterization (linear regression and two-layer neural networks with
nonlinear data distributions), allowing us to disentangle properties stemming
from the model architecture and random sampling of data. In both models,
increasing the number of fit parameters leads to a phase transition where the
training error goes to zero and the test error diverges as a result of the
variance (while the bias remains finite). Beyond this threshold, the test error
of the two-layer neural network decreases due to a monotonic decrease in
\emph{both} the bias and variance in contrast with the classical bias-variance
trade-off. We also show that in contrast with classical intuition,
over-parameterized models can overfit even in the absence of noise and exhibit
bias even if the student and teacher models match. We synthesize these results
to construct a holistic understanding of generalization error and the
bias-variance trade-off in over-parameterized models and relate our results to
random matrix theory.
Related papers
- Revisiting Optimism and Model Complexity in the Wake of Overparameterized Machine Learning [6.278498348219108]
We revisit model complexity from first principles, by first reinterpreting and then extending the classical statistical concept of (effective) degrees of freedom.
We demonstrate the utility of our proposed complexity measures through a mix of conceptual arguments, theory, and experiments.
arXiv Detail & Related papers (2024-10-02T06:09:57Z) - Aliasing and Label-Independent Decomposition of Risk: Beyond the bias-variance trade-off [0.0]
A central problem in data science is to use potentially noisy samples to predict function values for unseen inputs.
We introduce an alternative paradigm called the generalized aliasing decomposition (GAD)
GAD can be explicitly calculated from the relationship between model class and samples without seeing any data labels.
arXiv Detail & Related papers (2024-08-15T17:49:24Z) - Scaling and renormalization in high-dimensional regression [72.59731158970894]
This paper presents a succinct derivation of the training and generalization performance of a variety of high-dimensional ridge regression models.
We provide an introduction and review of recent results on these topics, aimed at readers with backgrounds in physics and deep learning.
arXiv Detail & Related papers (2024-05-01T15:59:00Z) - On the Strong Correlation Between Model Invariance and Generalization [54.812786542023325]
Generalization captures a model's ability to classify unseen data.
Invariance measures consistency of model predictions on transformations of the data.
From a dataset-centric view, we find a certain model's accuracy and invariance linearly correlated on different test sets.
arXiv Detail & Related papers (2022-07-14T17:08:25Z) - Bias-variance decomposition of overparameterized regression with random
linear features [0.0]
"Over parameterized models" avoid overfitting even when the number of fit parameters is large enough to perfectly fit the training data.
We show how each transition arises due to small nonzero eigenvalues in the Hessian matrix.
We compare and contrast the phase diagram of the random linear features model to the random nonlinear features model and ordinary regression.
arXiv Detail & Related papers (2022-03-10T16:09:21Z) - Optimization Variance: Exploring Generalization Properties of DNNs [83.78477167211315]
The test error of a deep neural network (DNN) often demonstrates double descent.
We propose a novel metric, optimization variance (OV), to measure the diversity of model updates.
arXiv Detail & Related papers (2021-06-03T09:34:17Z) - Post-mortem on a deep learning contest: a Simpson's paradox and the
complementary roles of scale metrics versus shape metrics [61.49826776409194]
We analyze a corpus of models made publicly-available for a contest to predict the generalization accuracy of neural network (NN) models.
We identify what amounts to a Simpson's paradox: where "scale" metrics perform well overall but perform poorly on sub partitions of the data.
We present two novel shape metrics, one data-independent, and the other data-dependent, which can predict trends in the test accuracy of a series of NNs.
arXiv Detail & Related papers (2021-06-01T19:19:49Z) - Understanding Double Descent Requires a Fine-Grained Bias-Variance
Decomposition [34.235007566913396]
We describe an interpretable, symmetric decomposition of the variance into terms associated with the labels.
We find that the bias decreases monotonically with the network width, but the variance terms exhibit non-monotonic behavior.
We also analyze the strikingly rich phenomenology that arises.
arXiv Detail & Related papers (2020-11-04T21:04:02Z) - What causes the test error? Going beyond bias-variance via ANOVA [21.359033212191218]
Modern machine learning methods are often overparametrized, allowing adaptation to the data at a fine level.
Recent work aimed to understand in greater depth why overparametrization is helpful for generalization.
We propose using the analysis of variance (ANOVA) to decompose the variance in the test error in a symmetric way.
arXiv Detail & Related papers (2020-10-11T05:21:13Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z) - An Investigation of Why Overparameterization Exacerbates Spurious
Correlations [98.3066727301239]
We identify two key properties of the training data that drive this behavior.
We show how the inductive bias of models towards "memorizing" fewer examples can cause over parameterization to hurt.
arXiv Detail & Related papers (2020-05-09T01:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.