Should Under-parameterized Student Networks Copy or Average Teacher
Weights?
- URL: http://arxiv.org/abs/2311.01644v2
- Date: Tue, 16 Jan 2024 00:21:43 GMT
- Title: Should Under-parameterized Student Networks Copy or Average Teacher
Weights?
- Authors: Berfin \c{S}im\c{s}ek, Amire Bendjeddou, Wulfram Gerstner, Johanni
Brea
- Abstract summary: We consider the case when $f*$ itself is a neural network with one hidden layer and $k$ neurons.
As the student has fewer neurons than the teacher, it is unclear whether each of the $n$ student neurons should copy one of the teacher neurons or rather average a group of teacher neurons.
We find for the erf activation function that flow gradient converges either to the optimal copy-average critical point or to another point where each student neuron approximately copies a different teacher neuron.
- Score: 7.777410338143785
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Any continuous function $f^*$ can be approximated arbitrarily well by a
neural network with sufficiently many neurons $k$. We consider the case when
$f^*$ itself is a neural network with one hidden layer and $k$ neurons.
Approximating $f^*$ with a neural network with $n< k$ neurons can thus be seen
as fitting an under-parameterized "student" network with $n$ neurons to a
"teacher" network with $k$ neurons. As the student has fewer neurons than the
teacher, it is unclear, whether each of the $n$ student neurons should copy one
of the teacher neurons or rather average a group of teacher neurons. For
shallow neural networks with erf activation function and for the standard
Gaussian input distribution, we prove that "copy-average" configurations are
critical points if the teacher's incoming vectors are orthonormal and its
outgoing weights are unitary. Moreover, the optimum among such configurations
is reached when $n-1$ student neurons each copy one teacher neuron and the
$n$-th student neuron averages the remaining $k-n+1$ teacher neurons. For the
student network with $n=1$ neuron, we provide additionally a closed-form
solution of the non-trivial critical point(s) for commonly used activation
functions through solving an equivalent constrained optimization problem.
Empirically, we find for the erf activation function that gradient flow
converges either to the optimal copy-average critical point or to another point
where each student neuron approximately copies a different teacher neuron.
Finally, we find similar results for the ReLU activation function, suggesting
that the optimal solution of underparameterized networks has a universal
structure.
Related papers
- Optimal Neural Network Approximation for High-Dimensional Continuous Functions [5.748690310135373]
We present a family of continuous functions that requires at least width $d$, and therefore at least $d$ intrinsic neurons, to achieve arbitrary accuracy in its approximation.
This shows that the requirement of $mathcalO(d)$ intrinsic neurons is optimal in the sense that it grows linearly with the input dimension $d$.
arXiv Detail & Related papers (2024-09-04T01:18:55Z) - On the High Symmetry of Neural Network Functions [0.0]
Training neural networks means solving a high-dimensional optimization problem.
This paper shows how due to how neural networks are designed, the neural network function present a very large symmetry in the parameter space.
arXiv Detail & Related papers (2022-11-12T07:51:14Z) - Normalization effects on deep neural networks [20.48472873675696]
We study the effect of the choice of the $gamma_i$ on the statistical behavior of the neural network's output.
We find that in terms of variance of the neural network's output and test accuracy the best choice is to choose the $gamma_i$s to be equal to one.
arXiv Detail & Related papers (2022-09-02T17:05:55Z) - Excess Risk of Two-Layer ReLU Neural Networks in Teacher-Student
Settings and its Superiority to Kernel Methods [58.44819696433327]
We investigate the risk of two-layer ReLU neural networks in a teacher regression model.
We find that the student network provably outperforms any solution methods.
arXiv Detail & Related papers (2022-05-30T02:51:36Z) - Neural Capacitance: A New Perspective of Neural Network Selection via
Edge Dynamics [85.31710759801705]
Current practice requires expensive computational costs in model training for performance prediction.
We propose a novel framework for neural network selection by analyzing the governing dynamics over synaptic connections (edges) during training.
Our framework is built on the fact that back-propagation during neural network training is equivalent to the dynamical evolution of synaptic connections.
arXiv Detail & Related papers (2022-01-11T20:53:15Z) - The Separation Capacity of Random Neural Networks [78.25060223808936]
We show that a sufficiently large two-layer ReLU-network with standard Gaussian weights and uniformly distributed biases can solve this problem with high probability.
We quantify the relevant structure of the data in terms of a novel notion of mutual complexity.
arXiv Detail & Related papers (2021-07-31T10:25:26Z) - Locality defeats the curse of dimensionality in convolutional
teacher-student scenarios [69.2027612631023]
We show that locality is key in determining the learning curve exponent $beta$.
We conclude by proving, using a natural assumption, that performing kernel regression with a ridge that decreases with the size of the training set leads to similar learning curve exponents to those we obtain in the ridgeless case.
arXiv Detail & Related papers (2021-06-16T08:27:31Z) - A Local Convergence Theory for Mildly Over-Parameterized Two-Layer
Neural Network [39.341620528427306]
We develop a local convergence theory for mildly over- parameterized neural networks.
We show that as long as the loss is already lower than a threshold, all student neurons converge to one of teacher neurons.
Our result holds for any number of student neurons as long as it is at least as large as the number of teacher neurons.
arXiv Detail & Related papers (2021-02-04T04:41:04Z) - Towards Understanding Hierarchical Learning: Benefits of Neural
Representations [160.33479656108926]
In this work, we demonstrate that intermediate neural representations add more flexibility to neural networks.
We show that neural representation can achieve improved sample complexities compared with the raw input.
Our results characterize when neural representations are beneficial, and may provide a new perspective on why depth is important in deep learning.
arXiv Detail & Related papers (2020-06-24T02:44:54Z) - Network size and weights size for memorization with two-layers neural
networks [15.333300054767726]
We propose a new training procedure for ReLU networks, based on complex (as opposed to real) recombination of the neurons.
We show approximate memorization with both $Oleft(fracnd cdot fraclog(1/epsilon)epsilonright)$ neurons, as well as nearly-optimal size of the weights.
arXiv Detail & Related papers (2020-06-04T13:44:57Z) - Non-linear Neurons with Human-like Apical Dendrite Activations [81.18416067005538]
We show that a standard neuron followed by our novel apical dendrite activation (ADA) can learn the XOR logical function with 100% accuracy.
We conduct experiments on six benchmark data sets from computer vision, signal processing and natural language processing.
arXiv Detail & Related papers (2020-02-02T21:09:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.