Early Directional Convergence in Deep Homogeneous Neural Networks for
Small Initializations
- URL: http://arxiv.org/abs/2403.08121v1
- Date: Tue, 12 Mar 2024 23:17:32 GMT
- Title: Early Directional Convergence in Deep Homogeneous Neural Networks for
Small Initializations
- Authors: Akshay Kumar and Jarvis Haupt
- Abstract summary: This paper studies the gradient flow dynamics that arise when training deep homogeneous neural networks.
The weights of the neural network remain small in norm and approximately converge in direction along the Karush-Kuhn-Tucker points.
- Score: 2.310288676109785
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper studies the gradient flow dynamics that arise when training deep
homogeneous neural networks, starting with small initializations. The present
work considers neural networks that are assumed to have locally Lipschitz
gradients and an order of homogeneity strictly greater than two. This paper
demonstrates that for sufficiently small initializations, during the early
stages of training, the weights of the neural network remain small in norm and
approximately converge in direction along the Karush-Kuhn-Tucker (KKT) points
of the neural correlation function introduced in [1]. Additionally, for square
loss and under a separability assumption on the weights of neural networks, a
similar directional convergence of gradient flow dynamics is shown near certain
saddle points of the loss function.
Related papers
- Towards Understanding Gradient Flow Dynamics of Homogeneous Neural Networks Beyond the Origin [1.9556053645976448]
Recent works have established that in the early stages of training, the weights remain small and near the origin, but converge in direction.
This paper studies the gradient flow dynamics of homogeneous neural networks with locally Lipschitz gradients, after they escape the origin.
arXiv Detail & Related papers (2025-02-21T21:32:31Z) - Novel Kernel Models and Exact Representor Theory for Neural Networks Beyond the Over-Parameterized Regime [52.00917519626559]
This paper presents two models of neural-networks and their training applicable to neural networks of arbitrary width, depth and topology.
We also present an exact novel representor theory for layer-wise neural network training with unregularized gradient descent in terms of a local-extrinsic neural kernel (LeNK)
This representor theory gives insight into the role of higher-order statistics in neural network training and the effect of kernel evolution in neural-network kernel models.
arXiv Detail & Related papers (2024-05-24T06:30:36Z) - Directional Convergence Near Small Initializations and Saddles in Two-Homogeneous Neural Networks [1.9556053645976448]
This paper examines gradient flow dynamics of two-homogeneous neural networks for small initializations.
For square loss, neural networks undergo saddle-to-saddle dynamics when close to the origin.
Motivated by this, this paper also shows a similar directional convergence among weights of small magnitude in the neighborhood of certain saddle points.
arXiv Detail & Related papers (2024-02-14T15:10:37Z) - Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z) - How many Neurons do we need? A refined Analysis for Shallow Networks
trained with Gradient Descent [0.0]
We analyze the generalization properties of two-layer neural networks in the neural tangent kernel regime.
We derive fast rates of convergence that are known to be minimax optimal in the framework of non-parametric regression.
arXiv Detail & Related papers (2023-09-14T22:10:28Z) - Stabilizing RNN Gradients through Pre-training [3.335932527835653]
Theory of learning proposes to prevent the gradient from exponential growth with depth or time, to stabilize and improve training.
We extend known stability theories to encompass a broader family of deep recurrent networks, requiring minimal assumptions on data and parameter distribution.
We propose a new approach to mitigate this issue, that consists on giving a weight of a half to the time and depth contributions to the gradient.
arXiv Detail & Related papers (2023-08-23T11:48:35Z) - Understanding the Initial Condensation of Convolutional Neural Networks [6.451914896767135]
kernels of two-layer convolutional neural networks converge to one or a few directions during training.
This work represents a step towards a better understanding of the non-linear training behavior exhibited by neural networks with specialized structures.
arXiv Detail & Related papers (2023-05-17T05:00:47Z) - Phase Diagram of Initial Condensation for Two-layer Neural Networks [4.404198015660192]
We present a phase diagram of initial condensation for two-layer neural networks.
Our phase diagram serves to provide a comprehensive understanding of the dynamical regimes of neural networks.
arXiv Detail & Related papers (2023-03-12T03:55:38Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Gradient flow dynamics of shallow ReLU networks for square loss and
orthogonal inputs [19.401271427657395]
The training of neural networks by gradient descent methods is a cornerstone of the deep learning revolution.
This article presents the gradient flow dynamics of one neural network for the mean squared error at small initialisation.
arXiv Detail & Related papers (2022-06-02T09:01:25Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Optimal Rates for Averaged Stochastic Gradient Descent under Neural
Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate.
We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z) - A Generalized Neural Tangent Kernel Analysis for Two-layer Neural
Networks [87.23360438947114]
We show that noisy gradient descent with weight decay can still exhibit a " Kernel-like" behavior.
This implies that the training loss converges linearly up to a certain accuracy.
We also establish a novel generalization error bound for two-layer neural networks trained by noisy gradient descent with weight decay.
arXiv Detail & Related papers (2020-02-10T18:56:15Z) - On the Convex Behavior of Deep Neural Networks in Relation to the
Layers' Width [99.24399270311069]
We observe that for wider networks, minimizing the loss with the descent optimization maneuvers through surfaces of positive curvatures at the start and end of training, and close to zero curvatures in between.
In other words, it seems that during crucial parts of the training process, the Hessian in wide networks is dominated by the component G.
arXiv Detail & Related papers (2020-01-14T16:30:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.