An Effective Gram Matrix Characterizes Generalization in Deep Networks
- URL: http://arxiv.org/abs/2504.16450v2
- Date: Thu, 24 Apr 2025 02:31:15 GMT
- Title: An Effective Gram Matrix Characterizes Generalization in Deep Networks
- Authors: Rubing Yang, Pratik Chaudhari,
- Abstract summary: We derive a differential equation that governs the evolution of the generalization gap when a deep network is trained by gradient descent.<n>We analyze this differential equation to compute an effective Gram matrix'' that characterizes the generalization gap after training.
- Score: 22.314071077213935
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We derive a differential equation that governs the evolution of the generalization gap when a deep network is trained by gradient descent. This differential equation is controlled by two quantities, a contraction factor that brings together trajectories corresponding to slightly different datasets, and a perturbation factor that accounts for them training on different datasets. We analyze this differential equation to compute an ``effective Gram matrix'' that characterizes the generalization gap after training in terms of the alignment between this Gram matrix and a certain initial ``residual''. Empirical evaluations on image classification datasets indicate that this analysis can predict the test loss accurately. Further, at any point during training, the residual predominantly lies in the subspace of the effective Gram matrix with the smallest eigenvalues. This indicates that the training process is benign, i.e., it does not lead to significant deterioration of the generalization gap (which is zero at initialization). The alignment between the effective Gram matrix and the residual is different for different datasets and architectures. The match/mismatch of the data and the architecture is primarily responsible for good/bad generalization.
Related papers
- Landscape Complexity for the Empirical Risk of Generalized Linear Models: Discrimination between Structured Data [2.486161976966064]
We use the Kac-Rice formula and results from random matrix theory to obtain the average number of critical points of a family of high-dimensional empirical loss functions.<n>The correlations are introduced to model the existence of structure in the data, as is common in current Machine-Learning systems.<n>For completeness, we also treat the case of a loss function used in training Generalized Linear Models in the presence of correlated input data.
arXiv Detail & Related papers (2025-03-18T16:44:33Z) - Generalization for Least Squares Regression With Simple Spiked Covariances [3.9134031118910264]
The generalization properties of even two-layer neural networks trained by gradient descent remain poorly understood.
Recent work has made progress by describing the spectrum of the feature matrix at the hidden layer.
Yet, the generalization error for linear models with spiked covariances has not been previously determined.
arXiv Detail & Related papers (2024-10-17T19:46:51Z) - Symmetry Discovery for Different Data Types [52.2614860099811]
Equivariant neural networks incorporate symmetries into their architecture, achieving higher generalization performance.
We propose LieSD, a method for discovering symmetries via trained neural networks which approximate the input-output mappings of the tasks.
We validate the performance of LieSD on tasks with symmetries such as the two-body problem, the moment of inertia matrix prediction, and top quark tagging.
arXiv Detail & Related papers (2024-10-13T13:39:39Z) - Implicit Regularization of Gradient Flow on One-Layer Softmax Attention [10.060496091806694]
We study gradient flow on the exponential loss for a classification problem with a one-layer softmax attention model.
Under a separability assumption on the data, we show that when gradient flow achieves the minimal loss value, it further implicitly minimizes the nuclear norm of the product of the key and query weight matrices.
arXiv Detail & Related papers (2024-03-13T17:02:27Z) - Learning High-Dimensional Differential Graphs From Multi-Attribute Data [12.94486861344922]
We consider the problem of estimating differences in two Gaussian graphical models (GGMs) which are known to have similar structure.
Existing methods for differential graph estimation are based on single-attribute (SA) models.
In this paper, we analyze a group lasso penalized D-trace loss function approach for differential graph learning from multi-attribute data.
arXiv Detail & Related papers (2023-12-05T18:54:46Z) - Does the Data Induce Capacity Control in Deep Learning? [0.0]
This paper studies how the dataset may be the cause of the anomalous generalization performance of deep networks.
We show that the data correlation matrix of typical classification datasets has an eigenspectrum where, after a sharp initial drop, a large number of small eigenvalues are distributed uniformly over an exponentially large range.
arXiv Detail & Related papers (2021-10-27T04:40:27Z) - Understanding the Generalization of Adam in Learning Neural Networks
with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization.
We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z) - Distribution of Classification Margins: Are All Data Equal? [61.16681488656473]
We motivate theoretically and show empirically that the area under the curve of the margin distribution on the training set is in fact a good measure of generalization.
The resulting subset of "high capacity" features is not consistent across different training runs.
arXiv Detail & Related papers (2021-07-21T16:41:57Z) - Multilayer Clustered Graph Learning [66.94201299553336]
We use contrastive loss as a data fidelity term, in order to properly aggregate the observed layers into a representative graph.
Experiments show that our method leads to a clustered clusters w.r.t.
We learn a clustering algorithm for solving clustering problems.
arXiv Detail & Related papers (2020-10-29T09:58:02Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z) - Eigendecomposition-Free Training of Deep Networks for Linear
Least-Square Problems [107.3868459697569]
We introduce an eigendecomposition-free approach to training a deep network.
We show that our approach is much more robust than explicit differentiation of the eigendecomposition.
Our method has better convergence properties and yields state-of-the-art results.
arXiv Detail & Related papers (2020-04-15T04:29:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.