Related papers: Over-parameterised Shallow Neural Networks with Asymmetrical Node Scaling: Global Convergence Guarantees and Feature Learning

Over-parameterised Shallow Neural Networks with Asymmetrical Node Scaling: Global Convergence Guarantees and Feature Learning

URL: http://arxiv.org/abs/2302.01002v2
Date: Tue, 18 Feb 2025 15:46:29 GMT
Title: Over-parameterised Shallow Neural Networks with Asymmetrical Node Scaling: Global Convergence Guarantees and Feature Learning
Authors: Francois Caron, Fadhel Ayed, Paul Jung, Hoil Lee, Juho Lee, Hongseok Yang,
Abstract summary: We consider optimisation of wide, shallow neural networks, where the output of each hidden node is scaled by a positive parameter.<n>We prove that for large such neural networks, with high probability, gradient flow and gradient descent converge to a global minimum and can learn features in some sense, unlike in the NTK parameterisation.
Score: 18.445445525911847
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider gradient-based optimisation of wide, shallow neural networks, where the output of each hidden node is scaled by a positive parameter. The scaling parameters are non-identical, differing from the classical Neural Tangent Kernel (NTK) parameterisation. We prove that for large such neural networks, with high probability, gradient flow and gradient descent converge to a global minimum and can learn features in some sense, unlike in the NTK parameterisation. We perform experiments illustrating our theoretical results and discuss the benefits of such scaling in terms of prunability and transfer learning.

Related papers

Global Convergence and Rich Feature Learning in $L$-Layer Infinite-Width Neural Networks under $μ$P Parametrization [66.03821840425539]
In this paper, we investigate the training dynamics of $L$-layer neural networks using the tensor gradient program (SGD) framework. We show that SGD enables these networks to learn linearly independent features that substantially deviate from their initial values. This rich feature space captures relevant data information and ensures that any convergent point of the training process is a global minimum.
arXiv Detail & Related papers (2025-03-12T17:33:13Z)
Stochastic Gradient Descent for Two-layer Neural Networks [2.0349026069285423]
This paper presents a study on the convergence rates of the descent (SGD) algorithm when applied to overparameterized two-layer neural networks. Our approach combines the Tangent Kernel (NTK) approximation with convergence analysis in the Reproducing Kernel Space (RKHS) generated by NTK. Our research framework enables us to explore the intricate interplay between kernel methods and optimization processes, shedding light on the dynamics and convergence properties of neural networks.
arXiv Detail & Related papers (2024-07-10T13:58:57Z)
Scalable Neural Network Kernels [22.299704296356836]
We introduce scalable neural network kernels (SNNKs), capable of approximating regular feedforward layers (FFLs) We also introduce the neural network bundling process that applies SNNKs to compactify deep neural network architectures. Our mechanism provides up to 5x reduction in the number of trainable parameters, while maintaining competitive accuracy.
arXiv Detail & Related papers (2023-10-20T02:12:56Z)
Variational Inference for Infinitely Deep Neural Networks [0.4061135251278187]
unbounded depth neural network (UDN) We introduce the unbounded depth neural network (UDN), an infinitely deep probabilistic model that adapts its complexity to the training data. We study the UDN on real and synthetic data.
arXiv Detail & Related papers (2022-09-21T03:54:34Z)
Parameter Convex Neural Networks [13.42851919291587]
We propose the exponential multilayer neural network (EMLP) which is convex with regard to the parameters of the neural network under some conditions. For late experiments, we use the same architecture to make the exponential graph convolutional network (EGCN) and do the experiment on the graph classificaion dataset.
arXiv Detail & Related papers (2022-06-11T16:44:59Z)
Deep Architecture Connectivity Matters for Its Convergence: A Fine-Grained Analysis [94.64007376939735]
We theoretically characterize the impact of connectivity patterns on the convergence of deep neural networks (DNNs) under gradient descent training. We show that by a simple filtration on "unpromising" connectivity patterns, we can trim down the number of models to evaluate.
arXiv Detail & Related papers (2022-05-11T17:43:54Z)
On Feature Learning in Neural Networks with Global Convergence Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF) We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF. We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z)
Random Features for the Neural Tangent Kernel [57.132634274795066]
We propose an efficient feature map construction of the Neural Tangent Kernel (NTK) of fully-connected ReLU network. We show that dimension of the resulting features is much smaller than other baseline feature map constructions to achieve comparable error bounds both in theory and practice.
arXiv Detail & Related papers (2021-04-03T09:08:12Z)
Topological obstructions in neural networks learning [67.8848058842671]
We study global properties of the loss gradient function flow. We use topological data analysis of the loss function and its Morse complex to relate local behavior along gradient trajectories with global properties of the loss surface.
arXiv Detail & Related papers (2020-12-31T18:53:25Z)
DebiNet: Debiasing Linear Models with Nonlinear Overparameterized Neural Networks [11.04121146441257]
We incorporate over- parameterized neural networks into semi-parametric models to bridge the gap between inference and prediction. We show the theoretical foundations that make this possible and demonstrate with numerical experiments. We propose a framework, DebiNet, in which we plug-in arbitrary feature selection methods to our semi-parametric neural network.
arXiv Detail & Related papers (2020-11-01T04:12:53Z)
Modeling from Features: a Mean-field Framework for Over-parameterized Deep Neural Networks [54.27962244835622]
This paper proposes a new mean-field framework for over- parameterized deep neural networks (DNNs) In this framework, a DNN is represented by probability measures and functions over its features in the continuous limit. We illustrate the framework via the standard DNN and the Residual Network (Res-Net) architectures.
arXiv Detail & Related papers (2020-07-03T01:37:16Z)
Optimal Rates for Averaged Stochastic Gradient Descent under Neural Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate. We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z)
On the infinite width limit of neural networks with a standard parameterization [52.07828272324366]
We propose an improved extrapolation of the standard parameterization that preserves all of these properties as width is taken to infinity. We show experimentally that the resulting kernels typically achieve similar accuracy to those resulting from an NTK parameterization.
arXiv Detail & Related papers (2020-01-21T01:02:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.