Self-Regularity of Non-Negative Output Weights for Overparameterized
Two-Layer Neural Networks
- URL: http://arxiv.org/abs/2103.01887v1
- Date: Tue, 2 Mar 2021 17:36:03 GMT
- Title: Self-Regularity of Non-Negative Output Weights for Overparameterized
Two-Layer Neural Networks
- Authors: David Gamarnik, Eren C. K{\i}z{\i}lda\u{g}, and Ilias Zadik
- Abstract summary: We consider the problem of finding a two-layer neural network with sigmoid, rectified linear unit (ReLU)
We then leverage our bounds to establish guarantees for such networks through emphfat-shattering dimension
Notably, our bounds also have good sample complexity (polynomials in $d$ with a low degree)
- Score: 16.64116123743938
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of finding a two-layer neural network with sigmoid,
rectified linear unit (ReLU), or binary step activation functions that "fits" a
training data set as accurately as possible as quantified by the training
error; and study the following question: \emph{does a low training error
guarantee that the norm of the output layer (outer norm) itself is small?} We
answer affirmatively this question for the case of non-negative output weights.
Using a simple covering number argument, we establish that under quite mild
distributional assumptions on the input/label pairs; any such network achieving
a small training error on polynomially many data necessarily has a
well-controlled outer norm. Notably, our results (a) have a polynomial (in $d$)
sample complexity, (b) are independent of the number of hidden units (which can
potentially be very high), (c) are oblivious to the training algorithm; and (d)
require quite mild assumptions on the data (in particular the input vector
$X\in\mathbb{R}^d$ need not have independent coordinates). We then leverage our
bounds to establish generalization guarantees for such networks through
\emph{fat-shattering dimension}, a scale-sensitive measure of the complexity
class that the network architectures we investigate belong to. Notably, our
generalization bounds also have good sample complexity (polynomials in $d$ with
a low degree), and are in fact near-linear for some important cases of
interest.
Related papers
- Sharper Guarantees for Learning Neural Network Classifiers with Gradient Methods [43.32546195968771]
We study the data-dependent convergence and generalization behavior of gradient methods for neural networks with smooth activation.
Our results improve upon the shortcomings of the well-established Rademacher complexity-based bounds.
We show that a large step-size significantly improves upon the NTK regime's results in classifying the XOR distribution.
arXiv Detail & Related papers (2024-10-13T21:49:29Z) - Computational-Statistical Gaps in Gaussian Single-Index Models [77.1473134227844]
Single-Index Models are high-dimensional regression problems with planted structure.
We show that computationally efficient algorithms, both within the Statistical Query (SQ) and the Low-Degree Polynomial (LDP) framework, necessarily require $Omega(dkstar/2)$ samples.
arXiv Detail & Related papers (2024-03-08T18:50:19Z) - Sampling weights of deep neural networks [1.2370077627846041]
We introduce a probability distribution, combined with an efficient sampling algorithm, for weights and biases of fully-connected neural networks.
In a supervised learning context, no iterative optimization or gradient computations of internal network parameters are needed.
We prove that sampled networks are universal approximators.
arXiv Detail & Related papers (2023-06-29T10:13:36Z) - Joint Edge-Model Sparse Learning is Provably Efficient for Graph Neural
Networks [89.28881869440433]
This paper provides the first theoretical characterization of joint edge-model sparse learning for graph neural networks (GNNs)
It proves analytically that both sampling important nodes and pruning neurons with the lowest-magnitude can reduce the sample complexity and improve convergence without compromising the test accuracy.
arXiv Detail & Related papers (2023-02-06T16:54:20Z) - Is Stochastic Gradient Descent Near Optimal? [0.0]
We show that gradient descent achieves small expected error with a number of samples and total number of queries.
This suggests that SGD nearly achieves the information-theoretic sample complexity bounds of Joen & Van Roy (arXiv:2203.00246) in a computationally efficient manner.
arXiv Detail & Related papers (2022-09-18T18:26:43Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - The Separation Capacity of Random Neural Networks [78.25060223808936]
We show that a sufficiently large two-layer ReLU-network with standard Gaussian weights and uniformly distributed biases can solve this problem with high probability.
We quantify the relevant structure of the data in terms of a novel notion of mutual complexity.
arXiv Detail & Related papers (2021-07-31T10:25:26Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Achieving Small Test Error in Mildly Overparameterized Neural Networks [30.664282759625948]
We show an algorithm which finds one of these points in time.
In addition, we prove that for a fully connected neural net, with an additional assumption on the data distribution, there is a time algorithm.
arXiv Detail & Related papers (2021-04-24T06:47:20Z) - Neural Networks are Convex Regularizers: Exact Polynomial-time Convex
Optimization Formulations for Two-layer Networks [70.15611146583068]
We develop exact representations of training two-layer neural networks with rectified linear units (ReLUs)
Our theory utilizes semi-infinite duality and minimum norm regularization.
arXiv Detail & Related papers (2020-02-24T21:32:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.