Related papers: Bayes-optimal learning of an extensive-width neural network from quadratically many samples

Bayes-optimal learning of an extensive-width neural network from quadratically many samples

URL: http://arxiv.org/abs/2408.03733v1
Date: Wed, 7 Aug 2024 12:41:56 GMT
Title: Bayes-optimal learning of an extensive-width neural network from quadratically many samples
Authors: Antoine Maillard, Emanuele Troiani, Simon Martin, Florent Krzakala, Lenka Zdeborová,
Abstract summary: We consider the problem of learning a target function corresponding to a single hidden layer neural network. We consider the limit where the input dimension and the network width are proportionally large.
Score: 28.315491743569897
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider the problem of learning a target function corresponding to a single hidden layer neural network, with a quadratic activation function after the first layer, and random weights. We consider the asymptotic limit where the input dimension and the network width are proportionally large. Recent work [Cui & al '23] established that linear regression provides Bayes-optimal test error to learn such a function when the number of available samples is only linear in the dimension. That work stressed the open challenge of theoretically analyzing the optimal test error in the more interesting regime where the number of samples is quadratic in the dimension. In this paper, we solve this challenge for quadratic activations and derive a closed-form expression for the Bayes-optimal test error. We also provide an algorithm, that we call GAMP-RIE, which combines approximate message passing with rotationally invariant matrix denoising, and that asymptotically achieves the optimal performance. Technically, our result is enabled by establishing a link with recent works on optimal denoising of extensive-rank matrices and on the ellipsoid fitting problem. We further show empirically that, in the absence of noise, randomly-initialized gradient descent seems to sample the space of weights, leading to zero training loss, and averaging over initialization leads to a test error equal to the Bayes-optimal one.

Related papers

Accelerated zero-order SGD under high-order smoothness and overparameterized regime [79.85163929026146]
We present a novel gradient-free algorithm to solve convex optimization problems. Such problems are encountered in medicine, physics, and machine learning. We provide convergence guarantees for the proposed algorithm under both types of noise.
arXiv Detail & Related papers (2024-11-21T10:26:17Z)
Sharper Guarantees for Learning Neural Network Classifiers with Gradient Methods [43.32546195968771]
We study the data-dependent convergence and generalization behavior of gradient methods for neural networks with smooth activation. Our results improve upon the shortcomings of the well-established Rademacher complexity-based bounds. We show that a large step-size significantly improves upon the NTK regime's results in classifying the XOR distribution.
arXiv Detail & Related papers (2024-10-13T21:49:29Z)
A Sample Efficient Alternating Minimization-based Algorithm For Robust Phase Retrieval [56.67706781191521]
In this work, we present a robust phase retrieval problem where the task is to recover an unknown signal. Our proposed oracle avoids the need for computationally spectral descent, using a simple gradient step and outliers.
arXiv Detail & Related papers (2024-09-07T06:37:23Z)
A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks. We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks. Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z)
Globally Optimal Training of Neural Networks with Threshold Activation Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations. We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z)
Bayes-optimal Learning of Deep Random Networks of Extensive-width [22.640648403570957]
We consider the problem of learning a target function corresponding to a deep, extensive, non-linear neural network with random Gaussian weights. We compute closed-form expressions for the test errors of ridge regression, kernel and random features regression. We show numerically that when the number of samples grows faster than the dimension, ridge and kernel methods become suboptimal, while neural networks achieve test error close to zero from quadratically many samples.
arXiv Detail & Related papers (2023-02-01T11:14:08Z)
A Neural Network Warm-Start Approach for the Inverse Acoustic Obstacle Scattering Problem [7.624866197576227]
We present a neural network warm-start approach for solving the inverse scattering problem. An initial guess for the optimization problem is obtained using a trained neural network. The algorithm remains robust to noise in the scattered field measurements and also converges to the true solution for limited aperture data.
arXiv Detail & Related papers (2022-12-16T22:18:48Z)
Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two. For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z)
Algorithms for Efficiently Learning Low-Rank Neural Networks [12.916132936159713]
We study algorithms for learning low-rank neural networks. We present a provably efficient algorithm which learns an optimal low-rank approximation to a single-hidden-layer ReLU network. We propose a novel low-rank framework for training low-rank $textitdeep$ networks.
arXiv Detail & Related papers (2022-02-02T01:08:29Z)
Error-Correcting Neural Networks for Two-Dimensional Curvature Computation in the Level-Set Method [0.0]
We present an error-neural-modeling-based strategy for approximating two-dimensional curvature in the level-set method. Our main contribution is a redesigned hybrid solver that relies on numerical schemes to enable machine-learning operations on demand.
arXiv Detail & Related papers (2022-01-22T05:14:40Z)
Non-Adaptive Adaptive Sampling on Turnstile Streams [57.619901304728366]
We give the first relative-error algorithms for column subset selection, subspace approximation, projective clustering, and volume on turnstile streams that use space sublinear in $n$. Our adaptive sampling procedure has a number of applications to various data summarization problems that either improve state-of-the-art or have only been previously studied in the more relaxed row-arrival model.
arXiv Detail & Related papers (2020-04-23T05:00:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.