Implicit Regularization in Tensor Factorization
- URL: http://arxiv.org/abs/2102.09972v1
- Date: Fri, 19 Feb 2021 15:10:26 GMT
- Title: Implicit Regularization in Tensor Factorization
- Authors: Noam Razin, Asaf Maman, Nadav Cohen
- Abstract summary: Implicit regularization in deep learning is perceived as a tendency of gradient-based optimization to fit training data with predictors of minimal "complexity"
We argue that tensor rank may pave way to explaining both implicit regularization of neural networks, and the properties of real-world data translating it to generalization.
- Score: 17.424619189180675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Implicit regularization in deep learning is perceived as a tendency of
gradient-based optimization to fit training data with predictors of minimal
"complexity." The fact that only some types of data give rise to generalization
is understood to result from them being especially amenable to fitting with low
complexity predictors. A major challenge towards formalizing this intuition is
to define complexity measures that are quantitative yet capture the essence of
data that admits generalization. With an eye towards this challenge, we analyze
the implicit regularization in tensor factorization, equivalent to a certain
non-linear neural network. We characterize the dynamics that gradient descent
induces on the factorization, and establish a bias towards low tensor rank, in
compliance with existing empirical evidence. Then, motivated by tensor rank
capturing implicit regularization of a non-linear neural network, we
empirically explore it as a measure of complexity, and find that it stays
extremely low when fitting standard datasets. This leads us to believe that
tensor rank may pave way to explaining both implicit regularization of neural
networks, and the properties of real-world data translating it to
generalization.
Related papers
- Implicit Regularization for Tubal Tensor Factorizations via Gradient Descent [4.031100721019478]
We provide a rigorous analysis of implicit regularization in an overparametrized tensor factorization problem beyond the lazy training regime.
We prove the first tensor result of its kind for gradient descent rather than gradient flow.
arXiv Detail & Related papers (2024-10-21T17:52:01Z) - On the Geometry of Regularization in Adversarial Training: High-Dimensional Asymptotics and Generalization Bounds [11.30047438005394]
This work investigates the question of how to choose the regularization norm $lVert cdot rVert$ in the context of high-dimensional adversarial training for binary classification.
We quantitatively characterize the relationship between perturbation size and the optimal choice of $lVert cdot rVert$, confirming the intuition that, in the data scarce regime, the type of regularization becomes increasingly important for adversarial training as perturbations grow in size.
arXiv Detail & Related papers (2024-10-21T14:53:12Z) - Gradient-Based Feature Learning under Structured Data [57.76552698981579]
In the anisotropic setting, the commonly used spherical gradient dynamics may fail to recover the true direction.
We show that appropriate weight normalization that is reminiscent of batch normalization can alleviate this issue.
In particular, under the spiked model with a suitably large spike, the sample complexity of gradient-based training can be made independent of the information exponent.
arXiv Detail & Related papers (2023-09-07T16:55:50Z) - The Inductive Bias of Flatness Regularization for Deep Matrix
Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks.
We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - Instance-Dependent Generalization Bounds via Optimal Transport [51.71650746285469]
Existing generalization bounds fail to explain crucial factors that drive the generalization of modern neural networks.
We derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function in the data space.
We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
arXiv Detail & Related papers (2022-11-02T16:39:42Z) - Implicit Regularization with Polynomial Growth in Deep Tensor
Factorization [4.30484058393522]
We study the implicit regularization effects of deep learning in tensor factorization.
We show that its effect in deep tensor factorization grows faithfully with the depth of the network.
arXiv Detail & Related papers (2022-07-18T21:04:37Z) - Gradient Starvation: A Learning Proclivity in Neural Networks [97.02382916372594]
Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task.
This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks.
arXiv Detail & Related papers (2020-11-18T18:52:08Z) - The Neural Tangent Kernel in High Dimensions: Triple Descent and a
Multi-Scale Theory of Generalization [34.235007566913396]
Modern deep learning models employ considerably more parameters than required to fit the training data. Whereas conventional statistical wisdom suggests such models should drastically overfit, in practice these models generalize remarkably well.
An emerging paradigm for describing this unexpected behavior is in terms of a emphdouble descent curve.
We provide a precise high-dimensional analysis of generalization with the Neural Tangent Kernel, which characterizes the behavior of wide neural networks with gradient descent.
arXiv Detail & Related papers (2020-08-15T20:55:40Z) - Understanding Generalization in Deep Learning via Tensor Methods [53.808840694241]
We advance the understanding of the relations between the network's architecture and its generalizability from the compression perspective.
We propose a series of intuitive, data-dependent and easily-measurable properties that tightly characterize the compressibility and generalizability of neural networks.
arXiv Detail & Related papers (2020-01-14T22:26:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.