Related papers: NGD converges to less degenerate solutions than SGD

NGD converges to less degenerate solutions than SGD

URL: http://arxiv.org/abs/2409.04913v2
Date: Thu, 12 Sep 2024 21:04:20 GMT
Title: NGD converges to less degenerate solutions than SGD
Authors: Moosa Saghir, N. R. Raghavendra, Zihe Liu, Evan Ryan Gunter,
Abstract summary: The number of free parameters, or dimension, of a model is a straightforward way to measure its complexity. But this is not an accurate measure of complexity: models capable of memorizing their training data often generalize well despite their high dimension. Effective dimension aims to more directly capture the complexity of a model by counting only the number of parameters required to represent the functionality of the model.
Score: 0.5249805590164902
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The number of free parameters, or dimension, of a model is a straightforward way to measure its complexity: a model with more parameters can encode more information. However, this is not an accurate measure of complexity: models capable of memorizing their training data often generalize well despite their high dimension. Effective dimension aims to more directly capture the complexity of a model by counting only the number of parameters required to represent the functionality of the model. Singular learning theory (SLT) proposes the learning coefficient $ \lambda $ as a more accurate measure of effective dimension. By describing the rate of increase of the volume of the region of parameter space around a local minimum with respect to loss, $ \lambda $ incorporates information from higher-order terms. We compare $ \lambda $ of models trained using natural gradient descent (NGD) and stochastic gradient descent (SGD), and find that those trained with NGD consistently have a higher effective dimension for both of our methods: the Hessian trace $ \text{Tr}(\mathbf{H}) $, and the estimate of the local learning coefficient (LLC) $ \hat{\lambda}(w^*) $.

Related papers

Scaling Laws in Linear Regression: Compute, Parameters, and Data [86.48154162485712]
We study the theory of scaling laws in an infinite dimensional linear regression setup. We show that the reducible part of the test error is $Theta(-(a-1) + N-(a-1)/a)$. Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.
arXiv Detail & Related papers (2024-06-12T17:53:29Z)
Computational-Statistical Gaps in Gaussian Single-Index Models [77.1473134227844]
Single-Index Models are high-dimensional regression problems with planted structure. We show that computationally efficient algorithms, both within the Statistical Query (SQ) and the Low-Degree Polynomial (LDP) framework, necessarily require $Omega(dkstar/2)$ samples.
arXiv Detail & Related papers (2024-03-08T18:50:19Z)
Gaussian process regression and conditional Karhunen-Lo\'{e}ve models for data assimilation in inverse problems [68.8204255655161]
We present a model inversion algorithm, CKLEMAP, for data assimilation and parameter estimation in partial differential equation models. The CKLEMAP method provides better scalability compared to the standard MAP method.
arXiv Detail & Related papers (2023-01-26T18:14:12Z)
Minimax Optimal Quantization of Linear Models: Information-Theoretic Limits and Efficient Algorithms [59.724977092582535]
We consider the problem of quantizing a linear model learned from measurements. We derive an information-theoretic lower bound for the minimax risk under this setting. We show that our method and upper-bounds can be extended for two-layer ReLU neural networks.
arXiv Detail & Related papers (2022-02-23T02:39:04Z)
Faster Convergence of Local SGD for Over-Parameterized Models [1.5504102675587357]
Modern machine learning architectures are often highly expressive. We analyze the convergence of Local SGD (or FedAvg) for such over-parameterized functions in heterogeneous data setting. For general convex loss functions, we establish an error bound $O(K/T)$ otherwise. For non-loss functions, we prove an error bound $O(K/T)$ in both cases. We complete our results by providing problem instances in which our established convergence rates are tight to a constant factor with a reasonably small stepsize.
arXiv Detail & Related papers (2022-01-30T04:05:56Z)
Exponential Family Model-Based Reinforcement Learning via Score Matching [97.31477125728844]
We propose an optimistic model-based algorithm, dubbed SMRL, for finitehorizon episodic reinforcement learning (RL) SMRL uses score matching, an unnormalized density estimation technique that enables efficient estimation of the model parameter by ridge regression.
arXiv Detail & Related papers (2021-12-28T15:51:07Z)
Inverting brain grey matter models with likelihood-free inference: a tool for trustable cytoarchitecture measurements [62.997667081978825]
characterisation of the brain grey matter cytoarchitecture with quantitative sensitivity to soma density and volume remains an unsolved challenge in dMRI. We propose a new forward model, specifically a new system of equations, requiring a few relatively sparse b-shells. We then apply modern tools from Bayesian analysis known as likelihood-free inference (LFI) to invert our proposed model.
arXiv Detail & Related papers (2021-11-15T09:08:27Z)
SGD Through the Lens of Kolmogorov Complexity [0.15229257192293197]
gradient descent (SGD) finds a solution that achieves $ (1-epsilon)$ classification accuracy on the entire dataset. This is the first result which is completely emphmodellemma - we don't require the model to have any specific architecture or activation function.
arXiv Detail & Related papers (2021-11-10T01:32:38Z)
Revisiting minimum description length complexity in overparameterized models [38.21167656112762]
We provide an extensive theoretical characterization of MDL-COMP for linear models and kernel methods. For kernel methods, we show that MDL-COMP informs minimax in-sample error, and can decrease as the dimensionality of the input increases. We also prove that MDL-COMP bounds the in-sample mean squared error (MSE)
arXiv Detail & Related papers (2020-06-17T22:45:14Z)
Learning the Stein Discrepancy for Training and Evaluating Energy-Based Models without Sampling [30.406623987492726]
We present a new method for evaluating and training unnormalized density models. We estimate the Stein discrepancy between the data density $p(x)$ and the model density $q(x)$ defined by a vector function of the data. This yields a novel goodness-of-fit test which outperforms existing methods on high dimensional data.
arXiv Detail & Related papers (2020-02-13T16:39:07Z)
Dual Stochastic Natural Gradient Descent and convergence of interior half-space gradient approximations [0.0]
Multinomial logistic regression (MLR) is widely used in statistics and machine learning. gradient descent (SGD) is the most common approach for determining the parameters of a MLR model in big data scenarios.
arXiv Detail & Related papers (2020-01-19T00:53:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.