Asymptotic Risk of Overparameterized Likelihood Models: Double Descent
Theory for Deep Neural Networks
- URL: http://arxiv.org/abs/2103.00500v1
- Date: Sun, 28 Feb 2021 13:02:08 GMT
- Title: Asymptotic Risk of Overparameterized Likelihood Models: Double Descent
Theory for Deep Neural Networks
- Authors: Ryumei Nakada, Masaaki Imaizumi
- Abstract summary: We investigate the risk of a general class of overvisibilityized likelihood models, including deep models.
We demonstrate that several explicit models, such as parallel deep neural networks and ensemble learning, are in agreement with our theory.
- Score: 12.132641563193582
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate the asymptotic risk of a general class of overparameterized
likelihood models, including deep models. The recent empirical success of
large-scale models has motivated several theoretical studies to investigate a
scenario wherein both the number of samples, $n$, and parameters, $p$, diverge
to infinity and derive an asymptotic risk at the limit. However, these theorems
are only valid for linear-in-feature models, such as generalized linear
regression, kernel regression, and shallow neural networks. Hence, it is
difficult to investigate a wider class of nonlinear models, including deep
neural networks with three or more layers. In this study, we consider a
likelihood maximization problem without the model constraints and analyze the
upper bound of an asymptotic risk of an estimator with penalization.
Technically, we combine a property of the Fisher information matrix with an
extended Marchenko-Pastur law and associate the combination with empirical
process techniques. The derived bound is general, as it describes both the
double descent and the regularized risk curves, depending on the penalization.
Our results are valid without the linear-in-feature constraints on models and
allow us to derive the general spectral distributions of a Fisher information
matrix from the likelihood. We demonstrate that several explicit models, such
as parallel deep neural networks and ensemble learning, are in agreement with
our theory. This result indicates that even large and deep models have a small
asymptotic risk if they exhibit a specific structure, such as divisibility. To
verify this finding, we conduct a real-data experiment with parallel deep
neural networks. Our results expand the applicability of the asymptotic risk
analysis, and may also contribute to the understanding and application of deep
learning.
Related papers
- Seeing Unseen: Discover Novel Biomedical Concepts via
Geometry-Constrained Probabilistic Modeling [53.7117640028211]
We present a geometry-constrained probabilistic modeling treatment to resolve the identified issues.
We incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space.
A spectral graph-theoretic method is devised to estimate the number of potential novel classes.
arXiv Detail & Related papers (2024-03-02T00:56:05Z) - The Surprising Harmfulness of Benign Overfitting for Adversarial
Robustness [13.120373493503772]
We prove a surprising result that even if the ground truth itself is robust to adversarial examples, the benignly overfitted model is benign in terms of the standard'' out-of-sample risk objective.
Our finding provides theoretical insights into the puzzling phenomenon observed in practice, where the true target function (e.g., human) is robust against adverasrial attack, while beginly overfitted neural networks lead to models that are not robust.
arXiv Detail & Related papers (2024-01-19T15:40:46Z) - The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer
Linear Networks [51.1848572349154]
neural network models that perfectly fit noisy data can generalize well to unseen test data.
We consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk.
arXiv Detail & Related papers (2021-08-25T22:01:01Z) - The Separation Capacity of Random Neural Networks [78.25060223808936]
We show that a sufficiently large two-layer ReLU-network with standard Gaussian weights and uniformly distributed biases can solve this problem with high probability.
We quantify the relevant structure of the data in terms of a novel notion of mutual complexity.
arXiv Detail & Related papers (2021-07-31T10:25:26Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Non-asymptotic Excess Risk Bounds for Classification with Deep
Convolutional Neural Networks [6.051520664893158]
We consider the problem of binary classification with a class of general deep convolutional neural networks.
We define the prefactors of the risk bounds in terms of the input data dimension and other model parameters.
We show that the classification methods with CNNs can circumvent the curse of dimensionality.
arXiv Detail & Related papers (2021-05-01T15:55:04Z) - Provable Model-based Nonlinear Bandit and Reinforcement Learning: Shelve
Optimism, Embrace Virtual Curvature [61.22680308681648]
We show that global convergence is statistically intractable even for one-layer neural net bandit with a deterministic reward.
For both nonlinear bandit and RL, the paper presents a model-based algorithm, Virtual Ascent with Online Model Learner (ViOL)
arXiv Detail & Related papers (2021-02-08T12:41:56Z) - The Neural Tangent Kernel in High Dimensions: Triple Descent and a
Multi-Scale Theory of Generalization [34.235007566913396]
Modern deep learning models employ considerably more parameters than required to fit the training data. Whereas conventional statistical wisdom suggests such models should drastically overfit, in practice these models generalize remarkably well.
An emerging paradigm for describing this unexpected behavior is in terms of a emphdouble descent curve.
We provide a precise high-dimensional analysis of generalization with the Neural Tangent Kernel, which characterizes the behavior of wide neural networks with gradient descent.
arXiv Detail & Related papers (2020-08-15T20:55:40Z) - Measuring Model Complexity of Neural Networks with Curve Activation
Functions [100.98319505253797]
We propose the linear approximation neural network (LANN) to approximate a given deep model with curve activation function.
We experimentally explore the training process of neural networks and detect overfitting.
We find that the $L1$ and $L2$ regularizations suppress the increase of model complexity.
arXiv Detail & Related papers (2020-06-16T07:38:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.