A Local Convergence Theory for Mildly Over-Parameterized Two-Layer
Neural Network
- URL: http://arxiv.org/abs/2102.02410v1
- Date: Thu, 4 Feb 2021 04:41:04 GMT
- Title: A Local Convergence Theory for Mildly Over-Parameterized Two-Layer
Neural Network
- Authors: Mo Zhou, Rong Ge, Chi Jin
- Abstract summary: We develop a local convergence theory for mildly over- parameterized neural networks.
We show that as long as the loss is already lower than a threshold, all student neurons converge to one of teacher neurons.
Our result holds for any number of student neurons as long as it is at least as large as the number of teacher neurons.
- Score: 39.341620528427306
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While over-parameterization is widely believed to be crucial for the success
of optimization for the neural networks, most existing theories on
over-parameterization do not fully explain the reason -- they either work in
the Neural Tangent Kernel regime where neurons don't move much, or require an
enormous number of neurons. In practice, when the data is generated using a
teacher neural network, even mildly over-parameterized neural networks can
achieve 0 loss and recover the directions of teacher neurons. In this paper we
develop a local convergence theory for mildly over-parameterized two-layer
neural net. We show that as long as the loss is already lower than a threshold
(polynomial in relevant parameters), all student neurons in an
over-parameterized two-layer neural network will converge to one of teacher
neurons, and the loss will go to 0. Our result holds for any number of student
neurons as long as it is at least as large as the number of teacher neurons,
and our convergence rate is independent of the number of student neurons. A key
component of our analysis is the new characterization of local optimization
landscape -- we show the gradient satisfies a special case of Lojasiewicz
property which is different from local strong convexity or PL conditions used
in previous work.
Related papers
- Decorrelating neurons using persistence [29.25969187808722]
We present two regularisation terms computed from the weights of a minimum spanning tree of a clique.
We demonstrate that naive minimisation of all correlations between neurons obtains lower accuracies than our regularisation terms.
We include a proof of differentiability of our regularisers, thus developing the first effective topological persistence-based regularisation terms.
arXiv Detail & Related papers (2023-08-09T11:09:14Z) - Spiking neural network for nonlinear regression [68.8204255655161]
Spiking neural networks carry the potential for a massive reduction in memory and energy consumption.
They introduce temporal and neuronal sparsity, which can be exploited by next-generation neuromorphic hardware.
A framework for regression using spiking neural networks is proposed.
arXiv Detail & Related papers (2022-10-06T13:04:45Z) - Extrapolation and Spectral Bias of Neural Nets with Hadamard Product: a
Polynomial Net Study [55.12108376616355]
The study on NTK has been devoted to typical neural network architectures, but is incomplete for neural networks with Hadamard products (NNs-Hp)
In this work, we derive the finite-width-K formulation for a special class of NNs-Hp, i.e., neural networks.
We prove their equivalence to the kernel regression predictor with the associated NTK, which expands the application scope of NTK.
arXiv Detail & Related papers (2022-09-16T06:36:06Z) - Consistency of Neural Networks with Regularization [0.0]
This paper proposes the general framework of neural networks with regularization and prove its consistency.
Two types of activation functions: hyperbolic function(Tanh) and rectified linear unit(ReLU) have been taken into consideration.
arXiv Detail & Related papers (2022-06-22T23:33:39Z) - Excess Risk of Two-Layer ReLU Neural Networks in Teacher-Student
Settings and its Superiority to Kernel Methods [58.44819696433327]
We investigate the risk of two-layer ReLU neural networks in a teacher regression model.
We find that the student network provably outperforms any solution methods.
arXiv Detail & Related papers (2022-05-30T02:51:36Z) - Optimal Learning Rates of Deep Convolutional Neural Networks: Additive
Ridge Functions [19.762318115851617]
We consider the mean squared error analysis for deep convolutional neural networks.
We show that, for additive ridge functions, convolutional neural networks followed by one fully connected layer with ReLU activation functions can reach optimal mini-max rates.
arXiv Detail & Related papers (2022-02-24T14:22:32Z) - Improving Spiking Neural Network Accuracy Using Time-based Neurons [0.24366811507669117]
Research on neuromorphic computing systems based on low-power spiking neural networks using analog neurons is in the spotlight.
As technology scales down, analog neurons are difficult to scale, and they suffer from reduced voltage headroom/dynamic range and circuit nonlinearities.
This paper first models the nonlinear behavior of existing current-mirror-based voltage-domain neurons designed in a 28nm process, and show SNN inference accuracy can be severely degraded by the effect of neuron's nonlinearity.
We propose a novel neuron, which processes incoming spikes in the time domain and greatly improves the linearity, thereby improving the inference accuracy compared to the
arXiv Detail & Related papers (2022-01-05T00:24:45Z) - SeReNe: Sensitivity based Regularization of Neurons for Structured
Sparsity in Neural Networks [13.60023740064471]
SeReNe is a method for learning sparse topologies with a structure.
We define the sensitivity of a neuron as the variation of the network output.
By including the neuron sensitivity in the cost function as a regularization term, we areable to prune neurons with low sensitivity.
arXiv Detail & Related papers (2021-02-07T10:53:30Z) - Topological obstructions in neural networks learning [67.8848058842671]
We study global properties of the loss gradient function flow.
We use topological data analysis of the loss function and its Morse complex to relate local behavior along gradient trajectories with global properties of the loss surface.
arXiv Detail & Related papers (2020-12-31T18:53:25Z) - Non-linear Neurons with Human-like Apical Dendrite Activations [81.18416067005538]
We show that a standard neuron followed by our novel apical dendrite activation (ADA) can learn the XOR logical function with 100% accuracy.
We conduct experiments on six benchmark data sets from computer vision, signal processing and natural language processing.
arXiv Detail & Related papers (2020-02-02T21:09:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.