Deep Linear Networks can Benignly Overfit when Shallow Ones Do
- URL: http://arxiv.org/abs/2209.09315v1
- Date: Mon, 19 Sep 2022 19:23:04 GMT
- Title: Deep Linear Networks can Benignly Overfit when Shallow Ones Do
- Authors: Niladri S. Chatterji, Philip M. Long
- Abstract summary: We show that randomly deep linear networks can closely approximate or even match known bounds for the minimum $ell$-norm interpolant.
Our analysis also reveals that interpolating deep linear models have exactly the same conditional variance as the minimum $ell$-norm solution.
- Score: 16.1176305285103
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We bound the excess risk of interpolating deep linear networks trained using
gradient flow. In a setting previously used to establish risk bounds for the
minimum $\ell_2$-norm interpolant, we show that randomly initialized deep
linear networks can closely approximate or even match known bounds for the
minimum $\ell_2$-norm interpolant. Our analysis also reveals that interpolating
deep linear models have exactly the same conditional variance as the minimum
$\ell_2$-norm solution. Since the noise affects the excess risk only through
the conditional variance, this implies that depth does not improve the
algorithm's ability to "hide the noise". Our simulations verify that aspects of
our bounds reflect typical behavior for simple data distributions. We also find
that similar phenomena are seen in simulations with ReLU networks, although the
situation there is more nuanced.
Related papers
- Sharper Guarantees for Learning Neural Network Classifiers with Gradient Methods [43.32546195968771]
We study the data-dependent convergence and generalization behavior of gradient methods for neural networks with smooth activation.
Our results improve upon the shortcomings of the well-established Rademacher complexity-based bounds.
We show that a large step-size significantly improves upon the NTK regime's results in classifying the XOR distribution.
arXiv Detail & Related papers (2024-10-13T21:49:29Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Sample Complexity of Neural Policy Mirror Descent for Policy
Optimization on Low-Dimensional Manifolds [75.51968172401394]
We study the sample complexity of the neural policy mirror descent (NPMD) algorithm with deep convolutional neural networks (CNN)
In each iteration of NPMD, both the value function and the policy can be well approximated by CNNs.
We show that NPMD can leverage the low-dimensional structure of state space to escape from the curse of dimensionality.
arXiv Detail & Related papers (2023-09-25T07:31:22Z) - Noisy Interpolation Learning with Shallow Univariate ReLU Networks [33.900009202637285]
Mallinar et. al. 2022 noted that neural networks seem to often exhibit tempered overfitting'', wherein the population risk does not converge to the Bayes optimal error.
We provide the first rigorous analysis of the overfitting behavior of regression with minimum weights.
arXiv Detail & Related papers (2023-07-28T08:41:12Z) - Implicit Regularization Leads to Benign Overfitting for Sparse Linear
Regression [16.551664358490658]
In deep learning, often the training process finds an interpolator (a solution with 0 training loss) but the test loss is still low.
One common mechanism for benign overfitting is implicit regularization, where the training process leads to additional properties for the interpolator.
We show that training our new model via gradient descent leads to an interpolator with near-optimal test loss.
arXiv Detail & Related papers (2023-02-01T05:41:41Z) - Minimax Optimal Quantization of Linear Models: Information-Theoretic
Limits and Efficient Algorithms [59.724977092582535]
We consider the problem of quantizing a linear model learned from measurements.
We derive an information-theoretic lower bound for the minimax risk under this setting.
We show that our method and upper-bounds can be extended for two-layer ReLU neural networks.
arXiv Detail & Related papers (2022-02-23T02:39:04Z) - Minimum $\ell_{1}$-norm interpolators: Precise asymptotics and multiple
descent [19.781475554462553]
This paper pursues theoretical understanding for an important type of interpolators: the minimum $ell_1$-norm interpolator.
We observe, and provide rigorous theoretical justification for, a curious multi-descent phenomenon.
Our finding is built upon an exact characterization of the risk behavior, which is governed by a system of two non-linear equations with two unknowns.
arXiv Detail & Related papers (2021-10-18T17:51:14Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - The Generalized Lasso with Nonlinear Observations and Generative Priors [63.541900026673055]
We make the assumption of sub-Gaussian measurements, which is satisfied by a wide range of measurement models.
We show that our result can be extended to the uniform recovery guarantee under the assumption of a so-called local embedding property.
arXiv Detail & Related papers (2020-06-22T16:43:35Z) - Revealing the Structure of Deep Neural Networks via Convex Duality [70.15611146583068]
We study regularized deep neural networks (DNNs) and introduce a convex analytic framework to characterize the structure of hidden layers.
We show that a set of optimal hidden layer weights for a norm regularized training problem can be explicitly found as the extreme points of a convex set.
We apply the same characterization to deep ReLU networks with whitened data and prove the same weight alignment holds.
arXiv Detail & Related papers (2020-02-22T21:13:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.