Ultra-fast feature learning for the training of two-layer neural networks in the two-timescale regime
- URL: http://arxiv.org/abs/2504.18208v1
- Date: Fri, 25 Apr 2025 09:40:10 GMT
- Title: Ultra-fast feature learning for the training of two-layer neural networks in the two-timescale regime
- Authors: Raphaël Barboni, Gabriel Peyré, François-Xavier Vialard,
- Abstract summary: We study the convergence of gradient methods for the training of mean-field single hidden layer neural networks with square loss.<n>We obtain guarantees for the convergence of the trained feature distribution towards the teacher feature distribution in a teacher-student setup.
- Score: 26.47265060394168
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the convergence of gradient methods for the training of mean-field single hidden layer neural networks with square loss. Observing this is a separable non-linear least-square problem which is linear w.r.t. the outer layer's weights, we consider a Variable Projection (VarPro) or two-timescale learning algorithm, thereby eliminating the linear variables and reducing the learning problem to the training of the feature distribution. Whereas most convergence rates or the training of neural networks rely on a neural tangent kernel analysis where features are fixed, we show such a strategy enables provable convergence rates for the sampling of a teacher feature distribution. Precisely, in the limit where the regularization strength vanishes, we show that the dynamic of the feature distribution corresponds to a weighted ultra-fast diffusion equation. Relying on recent results on the asymptotic behavior of such PDEs, we obtain guarantees for the convergence of the trained feature distribution towards the teacher feature distribution in a teacher-student setup.
Related papers
- An Analytic Solution to Covariance Propagation in Neural Networks [10.013553984400488]
This paper presents a sample-free moment propagation technique to accurately characterize the input-output distributions of neural networks.
A key enabler of our technique is an analytic solution for the covariance of random variables passed through nonlinear activation functions.
The wide applicability and merits of the proposed technique are shown in experiments analyzing the input-output distributions of trained neural networks and training Bayesian neural networks.
arXiv Detail & Related papers (2024-03-24T14:08:24Z) - Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - Learning Theory of Distribution Regression with Neural Networks [6.961253535504979]
We establish an approximation theory and a learning theory of distribution regression via a fully connected neural network (FNN)
In contrast to the classical regression methods, the input variables of distribution regression are probability measures.
arXiv Detail & Related papers (2023-07-07T09:49:11Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Mean-Field Analysis of Two-Layer Neural Networks: Global Optimality with
Linear Convergence Rates [7.094295642076582]
Mean-field regime is a theoretically attractive alternative to the NTK (lazy training) regime.
We establish a new linear convergence result for two-layer neural networks trained by continuous-time noisy descent in the mean-field regime.
arXiv Detail & Related papers (2022-05-19T21:05:40Z) - On Feature Learning in Neural Networks with Global Convergence
Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF)
We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF.
We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z) - On the Convergence of Shallow Neural Network Training with Randomly
Masked Neurons [11.119895959906085]
Given a dense shallow neural network, we focus on creating, training, and combining randomly selected functions.
By analyzing $i)$ theworks' neural kernel, $ii)$ the surrogate functions' gradient, and $iii)$ how we sample and combine the surrogate functions, we prove linear convergence rate of the training error.
For fixed neuron selection probability, the error term decreases as we increase the number of surrogate models, and increases as we increase the number of local training steps.
arXiv Detail & Related papers (2021-12-05T19:51:14Z) - Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks [83.58049517083138]
We consider a two-layer ReLU network trained via gradient descent.
We show that SGD is biased towards a simple solution.
We also provide empirical evidence that knots at locations distinct from the data points might occur.
arXiv Detail & Related papers (2021-11-03T15:14:20Z) - Learning Neural Network Subspaces [74.44457651546728]
Recent observations have advanced our understanding of the neural network optimization landscape.
With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks.
With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks.
arXiv Detail & Related papers (2021-02-20T23:26:58Z) - Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks
Trained with the Logistic Loss [0.0]
Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks.
We analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations.
arXiv Detail & Related papers (2020-02-11T15:42:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.