A case where a spindly two-layer linear network whips any neural network
with a fully connected input layer
- URL: http://arxiv.org/abs/2010.08625v1
- Date: Fri, 16 Oct 2020 20:49:58 GMT
- Title: A case where a spindly two-layer linear network whips any neural network
with a fully connected input layer
- Authors: Manfred K. Warmuth, Wojciech Kot{\l}owski, Ehsan Amid
- Abstract summary: We show that a sparse input layer is needed to sample efficiently learn sparse targets with gradient descent.
Surprisingly the same type of problem can be solved drastically more efficient by a simple 2-layer linear neural network.
- Score: 24.132345589750592
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It was conjectured that any neural network of any structure and arbitrary
differentiable transfer functions at the nodes cannot learn the following
problem sample efficiently when trained with gradient descent: The instances
are the rows of a $d$-dimensional Hadamard matrix and the target is one of the
features, i.e. very sparse. We essentially prove this conjecture: We show that
after receiving a random training set of size $k < d$, the expected square loss
is still $1-\frac{k}{(d-1)}$. The only requirement needed is that the input
layer is fully connected and the initial weight vectors of the input nodes are
chosen from a rotation invariant distribution.
Surprisingly the same type of problem can be solved drastically more
efficient by a simple 2-layer linear neural network in which the $d$ inputs are
connected to the output node by chains of length 2 (Now the input layer has
only one edge per input). When such a network is trained by gradient descent,
then it has been shown that its expected square loss is $\frac{\log d}{k}$.
Our lower bounds essentially show that a sparse input layer is needed to
sample efficiently learn sparse targets with gradient descent when the number
of examples is less than the number of input features.
Related papers
- Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z) - SGD Finds then Tunes Features in Two-Layer Neural Networks with
near-Optimal Sample Complexity: A Case Study in the XOR problem [1.3597551064547502]
We consider the optimization process of minibatch descent gradient (SGD) on a 2-layer neural network with data separated by a quadratic ground truth function.
We prove that with data drawn from the $d$-dimensional Boolean hypercube labeled by the quadratic XOR'' function $y = -x_ix_j$, it is possible to train to a population error $o(1)$ with $d :textpolylog(d)$ samples.
arXiv Detail & Related papers (2023-09-26T17:57:44Z) - Beyond NTK with Vanilla Gradient Descent: A Mean-Field Analysis of
Neural Networks with Polynomial Width, Samples, and Time [37.73689342377357]
It is still an open question whether gradient descent on networks without unnatural modifications can achieve better sample complexity than kernel methods.
We show that projected gradient descent with a positive learning number converges to low error with the same sample.
arXiv Detail & Related papers (2023-06-28T16:45:38Z) - Generalization and Stability of Interpolating Neural Networks with
Minimal Width [37.908159361149835]
We investigate the generalization and optimization of shallow neural-networks trained by gradient in the interpolating regime.
We prove the training loss number minimizations $m=Omega(log4 (n))$ neurons and neurons $Tapprox n$.
With $m=Omega(log4 (n))$ neurons and $Tapprox n$, we bound the test loss training by $tildeO (1/)$.
arXiv Detail & Related papers (2023-02-18T05:06:15Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Bounding the Width of Neural Networks via Coupled Initialization -- A
Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks.
We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z) - On the Effective Number of Linear Regions in Shallow Univariate ReLU
Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons.
Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z) - Algorithms for Efficiently Learning Low-Rank Neural Networks [12.916132936159713]
We study algorithms for learning low-rank neural networks.
We present a provably efficient algorithm which learns an optimal low-rank approximation to a single-hidden-layer ReLU network.
We propose a novel low-rank framework for training low-rank $textitdeep$ networks.
arXiv Detail & Related papers (2022-02-02T01:08:29Z) - Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK [58.5766737343951]
We consider the dynamic of descent for learning a two-layer neural network.
We show that an over-parametrized two-layer neural network can provably learn with gradient loss at most ground with Tangent samples.
arXiv Detail & Related papers (2020-07-09T07:09:28Z) - Towards Understanding Hierarchical Learning: Benefits of Neural
Representations [160.33479656108926]
In this work, we demonstrate that intermediate neural representations add more flexibility to neural networks.
We show that neural representation can achieve improved sample complexities compared with the raw input.
Our results characterize when neural representations are beneficial, and may provide a new perspective on why depth is important in deep learning.
arXiv Detail & Related papers (2020-06-24T02:44:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.