Related papers: Bounding the Width of Neural Networks via Coupled Initialization -- A Worst Case Analysis

Bounding the Width of Neural Networks via Coupled Initialization -- A Worst Case Analysis

URL: http://arxiv.org/abs/2206.12802v1
Date: Sun, 26 Jun 2022 06:51:31 GMT
Title: Bounding the Width of Neural Networks via Coupled Initialization -- A Worst Case Analysis
Authors: Alexander Munteanu, Simon Omlor, Zhao Song, David P. Woodruff
Abstract summary: We show how to significantly reduce the number of neurons required for two-layer ReLU networks. We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
Score: 121.9821494461427
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A common method in training neural networks is to initialize all the weights to be independent Gaussian vectors. We observe that by instead initializing the weights into independent pairs, where each pair consists of two identical Gaussian vectors, we can significantly improve the convergence analysis. While a similar technique has been studied for random inputs [Daniely, NeurIPS 2020], it has not been analyzed with arbitrary inputs. Using this technique, we show how to significantly reduce the number of neurons required for two-layer ReLU networks, both in the under-parameterized setting with logistic loss, from roughly $\gamma^{-8}$ [Ji and Telgarsky, ICLR 2020] to $\gamma^{-2}$, where $\gamma$ denotes the separation margin with a Neural Tangent Kernel, as well as in the over-parameterized setting with squared loss, from roughly $n^4$ [Song and Yang, 2019] to $n^2$, implicitly also improving the recent running time bound of [Brand, Peng, Song and Weinstein, ITCS 2021]. For the under-parameterized setting we also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.

Related papers

Optimization and generalization analysis for two-layer physics-informed neural networks without over-parametrization [0.6215404942415159]
This work focuses on the behavior of gradient descent (SGD) in solving least-squares regression with physics-informed neural networks (PINNs)<n>We show that if the network width exceeds a threshold that depends only on $epsilon$ and the problem, then the training loss and expected loss will decrease below $O(epsilon)$.
arXiv Detail & Related papers (2025-07-22T09:24:22Z)
Enhanced Feature Learning via Regularisation: Integrating Neural Networks and Kernel Methods [0.0]
We introduce an efficient method for the estimator, called Brownian Kernel Neural Network (BKerNN) We show that BKerNN's expected risk converges to the minimal risk with explicit high-probability rates of $O( min((d/n)1/2, n-1/6)$ (up to logarithmic factors)
arXiv Detail & Related papers (2024-07-24T13:46:50Z)
Preconditioned Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression [8.130817534654089]
We consider nonparametric regression by a two-layer neural network trained by gradient descent (GD) or its variant in this paper. We show that, if the neural network is trained with a novel Preconditioned Gradient Descent (PGD) with early stopping and the target function has spectral bias widely studied in the deep learning literature, the trained network renders a particularly sharp generalization bound with a minimax optimal rate of $cO(1/n4alpha/(4alpha+1)$.
arXiv Detail & Related papers (2024-07-16T03:38:34Z)
Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks. In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z)
Matching the Statistical Query Lower Bound for k-sparse Parity Problems with Stochastic Gradient Descent [83.85536329832722]
We show that gradient descent (SGD) can efficiently solve the $k$-parity problem on a $d$dimensional hypercube. We then demonstrate how a trained neural network with SGD, solving the $k$-parity problem with small statistical errors.
arXiv Detail & Related papers (2024-04-18T17:57:53Z)
Generalization and Stability of Interpolating Neural Networks with Minimal Width [37.908159361149835]
We investigate the generalization and optimization of shallow neural-networks trained by gradient in the interpolating regime. We prove the training loss number minimizations $m=Omega(log4 (n))$ neurons and neurons $Tapprox n$. With $m=Omega(log4 (n))$ neurons and $Tapprox n$, we bound the test loss training by $tildeO (1/)$.
arXiv Detail & Related papers (2023-02-18T05:06:15Z)
Training Overparametrized Neural Networks in Sublinear Time [14.918404733024332]
Deep learning comes at a tremendous computational and energy cost. We present a new and a subset of binary neural networks, as a small subset of search trees, where each corresponds to a subset of search trees (Ds) We believe this view would have further applications in analysis analysis of deep networks (Ds)
arXiv Detail & Related papers (2022-08-09T02:29:42Z)
High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation [89.21686761957383]
We study the first gradient descent step on the first-layer parameters $boldsymbolW$ in a two-layer network. Our results demonstrate that even one step can lead to a considerable advantage over random features.
arXiv Detail & Related papers (2022-05-03T12:09:59Z)
Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss. We examine how these benign overfitting phenomena occur in a two-layer neural network setting. We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z)
An efficient projection neural network for $\ell_1$-regularized logistic regression [10.517079029721257]
This paper presents a simple projection neural network for $ell_$-regularized logistics regression. The proposed neural network does not require any extra auxiliary variable nor any smooth approximation. We also investigate the convergence of the proposed neural network by using the Lyapunov theory and show that it converges to a solution of the problem with any arbitrary initial value.
arXiv Detail & Related papers (2021-05-12T06:13:44Z)
Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$ We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z)
Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK [58.5766737343951]
We consider the dynamic of descent for learning a two-layer neural network. We show that an over-parametrized two-layer neural network can provably learn with gradient loss at most ground with Tangent samples.
arXiv Detail & Related papers (2020-07-09T07:09:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.