Learning Neural Networks by Neuron Pursuit
- URL: http://arxiv.org/abs/2509.12154v1
- Date: Mon, 15 Sep 2025 17:18:35 GMT
- Title: Learning Neural Networks by Neuron Pursuit
- Authors: Akshay Kumar, Jarvis Haupt,
- Abstract summary: This paper studies the evolution of gradient flow for homogeneous neural networks near a class of saddle points exhibiting a sparsity structure.<n>The choice of these saddle points is motivated from previous works on homogeneous networks, which identified the first saddle point encountered by gradient flow after escaping the origin.<n>The second part of the paper introduces a greedy algorithm to train deep neural networks called Neuron Pursuit (NP)
- Score: 0.9975341265604576
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The first part of this paper studies the evolution of gradient flow for homogeneous neural networks near a class of saddle points exhibiting a sparsity structure. The choice of these saddle points is motivated from previous works on homogeneous networks, which identified the first saddle point encountered by gradient flow after escaping the origin. It is shown here that, when initialized sufficiently close to such saddle points, gradient flow remains near the saddle point for a sufficiently long time, during which the set of weights with small norm remain small but converge in direction. Furthermore, important empirical observations are made on the behavior of gradient descent after escaping these saddle points. The second part of the paper, motivated by these results, introduces a greedy algorithm to train deep neural networks called Neuron Pursuit (NP). It is an iterative procedure which alternates between expanding the network by adding neuron(s) with carefully chosen weights, and minimizing the training loss using this augmented network. The efficacy of the proposed algorithm is validated using numerical experiments.
Related papers
- Towards Understanding Gradient Flow Dynamics of Homogeneous Neural Networks Beyond the Origin [1.9556053645976448]
Recent works have established that in the early stages of training, the weights remain small and near the origin, but converge in direction.<n>This paper studies the gradient flow dynamics of homogeneous neural networks with locally Lipschitz gradients, after they escape the origin.
arXiv Detail & Related papers (2025-02-21T21:32:31Z) - Early Directional Convergence in Deep Homogeneous Neural Networks for Small Initializations [1.9556053645976448]
This paper studies the gradient flow dynamics that arise when training deep homogeneous neural networks assumed to have locally Lipschitz homogeneity and an order of strictly greater than two.
arXiv Detail & Related papers (2024-03-12T23:17:32Z) - Directional Convergence Near Small Initializations and Saddles in Two-Homogeneous Neural Networks [1.9556053645976448]
This paper examines gradient flow dynamics of two-homogeneous neural networks for small initializations.
For square loss, neural networks undergo saddle-to-saddle dynamics when close to the origin.
Motivated by this, this paper also shows a similar directional convergence among weights of small magnitude in the neighborhood of certain saddle points.
arXiv Detail & Related papers (2024-02-14T15:10:37Z) - Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - On the Neural Tangent Kernel Analysis of Randomly Pruned Neural Networks [91.3755431537592]
We study how random pruning of the weights affects a neural network's neural kernel (NTK)
In particular, this work establishes an equivalence of the NTKs between a fully-connected neural network and its randomly pruned version.
arXiv Detail & Related papers (2022-03-27T15:22:19Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z) - Critical Point-Finding Methods Reveal Gradient-Flat Regions of Deep
Network Losses [2.046307988932347]
gradient-based algorithms converge to approximately the same performance from random initial points.
We show that the methods used to find putative critical points suffer from a bad minima problem of their own.
arXiv Detail & Related papers (2020-03-23T17:16:19Z) - MSE-Optimal Neural Network Initialization via Layer Fusion [68.72356718879428]
Deep neural networks achieve state-of-the-art performance for a range of classification and inference tasks.
The use of gradient combined nonvolutionity renders learning susceptible to novel problems.
We propose fusing neighboring layers of deeper networks that are trained with random variables.
arXiv Detail & Related papers (2020-01-28T18:25:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.