Depth Without the Magic: Inductive Bias of Natural Gradient Descent
- URL: http://arxiv.org/abs/2111.11542v1
- Date: Mon, 22 Nov 2021 21:20:10 GMT
- Title: Depth Without the Magic: Inductive Bias of Natural Gradient Descent
- Authors: Anna Kerekes, Anna M\'esz\'aros, Ferenc Husz\'ar
- Abstract summary: In gradient descent, changing how we parametrize the model can lead to drastically different optimization trajectories.
We characterize the behaviour of natural gradient flow in deep linear networks for separable classification under logistic loss and deep matrix factorization.
We demonstrate that there exist learning problems where natural gradient descent fails to generalize, while gradient descent with the right architecture performs well.
- Score: 1.020554144865699
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In gradient descent, changing how we parametrize the model can lead to
drastically different optimization trajectories, giving rise to a surprising
range of meaningful inductive biases: identifying sparse classifiers or
reconstructing low-rank matrices without explicit regularization. This implicit
regularization has been hypothesised to be a contributing factor to good
generalization in deep learning. However, natural gradient descent is
approximately invariant to reparameterization, it always follows the same
trajectory and finds the same optimum. The question naturally arises: What
happens if we eliminate the role of parameterization, which solution will be
found, what new properties occur? We characterize the behaviour of natural
gradient flow in deep linear networks for separable classification under
logistic loss and deep matrix factorization. Some of our findings extend to
nonlinear neural networks with sufficient but finite over-parametrization. We
demonstrate that there exist learning problems where natural gradient descent
fails to generalize, while gradient descent with the right architecture
performs well.
Related papers
- On the Convergence of Gradient Descent for Large Learning Rates [55.33626480243135]
We show that convergence is impossible when a fixed step size is used.
We provide a proof of this in the case of linear neural networks with a squared loss.
We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient.
arXiv Detail & Related papers (2024-02-20T16:01:42Z) - Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z) - The Inductive Bias of Flatness Regularization for Deep Matrix
Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks.
We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z) - Implicit Regularization for Group Sparsity [33.487964460794764]
We show that gradient descent over the squared regression loss, without any explicit regularization, biases towards solutions with a group sparsity structure.
We analyze the gradient dynamics of the corresponding regression problem in the general noise setting and obtain minimax-optimal error rates.
In the degenerate case of size-one groups, our approach gives rise to a new algorithm for sparse linear regression.
arXiv Detail & Related papers (2023-01-29T20:54:03Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - More is Less: Inducing Sparsity via Overparameterization [2.885175627590247]
In deep learning it is common to over parameterize neural networks, that is, to use more parameters than training samples.
Quite surprisingly, generalize the neural network via (stochastic) gradient descent leads to that very well.
Our proof relies on analyzing a certain Bregman divergence of the flow.
arXiv Detail & Related papers (2021-12-21T07:55:55Z) - Implicit Regularization in ReLU Networks with the Square Loss [56.70360094597169]
We show that it is impossible to characterize the implicit regularization with the square loss by any explicit function of the model parameters.
Our results suggest that a more general framework may be needed to understand implicit regularization for nonlinear predictors.
arXiv Detail & Related papers (2020-12-09T16:48:03Z) - Gradient Descent for Deep Matrix Factorization: Dynamics and Implicit
Bias towards Low Rank [1.9350867959464846]
In deep learning, gradientdescent tends to prefer solutions which generalize well.
In this paper we analyze the dynamics of gradient descent in the simplifiedsetting of linear networks and of an estimation problem.
arXiv Detail & Related papers (2020-11-27T15:08:34Z) - Coherent Gradients: An Approach to Understanding Generalization in
Gradient Descent-based Optimization [15.2292571922932]
We propose an approach to answering this question based on a hypothesis about the dynamics of gradient descent.
We show that changes to the network parameters during training are biased towards those that (locally) simultaneously benefit many examples.
arXiv Detail & Related papers (2020-02-25T03:59:31Z) - Training Two-Layer ReLU Networks with Gradient Descent is Inconsistent [2.7793394375935088]
We prove that two-layer (Leaky)ReLU networks by e.g. the widely used method proposed by He et al. are not consistent.
arXiv Detail & Related papers (2020-02-12T09:22:45Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.