Related papers: Should Under-parameterized Student Networks Copy or Average Teacher Weights?

Should Under-parameterized Student Networks Copy or Average Teacher Weights?

URL: http://arxiv.org/abs/2311.01644v2
Date: Tue, 16 Jan 2024 00:21:43 GMT
Title: Should Under-parameterized Student Networks Copy or Average Teacher Weights?
Authors: Berfin \c{S}im\c{s}ek, Amire Bendjeddou, Wulfram Gerstner, Johanni Brea
Abstract summary: We consider the case when $f*$ itself is a neural network with one hidden layer and $k$ neurons. As the student has fewer neurons than the teacher, it is unclear whether each of the $n$ student neurons should copy one of the teacher neurons or rather average a group of teacher neurons. We find for the erf activation function that flow gradient converges either to the optimal copy-average critical point or to another point where each student neuron approximately copies a different teacher neuron.
Score: 7.777410338143785
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Any continuous function $f^*$ can be approximated arbitrarily well by a neural network with sufficiently many neurons $k$. We consider the case when $f^*$ itself is a neural network with one hidden layer and $k$ neurons. Approximating $f^*$ with a neural network with $n< k$ neurons can thus be seen as fitting an under-parameterized "student" network with $n$ neurons to a "teacher" network with $k$ neurons. As the student has fewer neurons than the teacher, it is unclear, whether each of the $n$ student neurons should copy one of the teacher neurons or rather average a group of teacher neurons. For shallow neural networks with erf activation function and for the standard Gaussian input distribution, we prove that "copy-average" configurations are critical points if the teacher's incoming vectors are orthonormal and its outgoing weights are unitary. Moreover, the optimum among such configurations is reached when $n-1$ student neurons each copy one teacher neuron and the $n$-th student neuron averages the remaining $k-n+1$ teacher neurons. For the student network with $n=1$ neuron, we provide additionally a closed-form solution of the non-trivial critical point(s) for commonly used activation functions through solving an equivalent constrained optimization problem. Empirically, we find for the erf activation function that gradient flow converges either to the optimal copy-average critical point or to another point where each student neuron approximately copies a different teacher neuron. Finally, we find similar results for the ReLU activation function, suggesting that the optimal solution of underparameterized networks has a universal structure.

Related papers

Stably unactivated neurons in ReLU neural networks [1.347660513756976]
In ReLU neural networks, the presence of stably unactivated neurons can reduce the network's expressiveness. In this work, we investigate the probability of a neuron in the second hidden layer of such neural networks being stably unactivated.
arXiv Detail & Related papers (2024-12-06T22:15:22Z)
Optimal Neural Network Approximation for High-Dimensional Continuous Functions [5.748690310135373]
We present a family of continuous functions that requires at least width $d$, and therefore at least $d$ intrinsic neurons, to achieve arbitrary accuracy in its approximation. This shows that the requirement of $mathcalO(d)$ intrinsic neurons is optimal in the sense that it grows linearly with the input dimension $d$.
arXiv Detail & Related papers (2024-09-04T01:18:55Z)
Interpolation with deep neural networks with non-polynomial activations: necessary and sufficient numbers of neurons [0.0]
We prove that $Theta(sqrtnd')$ neurons are sufficient as long as the activation function is real at a point and not a point and not a there. This means that activation functions can be freely chosen in a problem-dependent manner without loss of power.
arXiv Detail & Related papers (2024-05-22T15:29:45Z)
On the High Symmetry of Neural Network Functions [0.0]
Training neural networks means solving a high-dimensional optimization problem. This paper shows how due to how neural networks are designed, the neural network function present a very large symmetry in the parameter space.
arXiv Detail & Related papers (2022-11-12T07:51:14Z)
Normalization effects on deep neural networks [20.48472873675696]
We study the effect of the choice of the $gamma_i$ on the statistical behavior of the neural network's output. We find that in terms of variance of the neural network's output and test accuracy the best choice is to choose the $gamma_i$s to be equal to one.
arXiv Detail & Related papers (2022-09-02T17:05:55Z)
Excess Risk of Two-Layer ReLU Neural Networks in Teacher-Student Settings and its Superiority to Kernel Methods [58.44819696433327]
We investigate the risk of two-layer ReLU neural networks in a teacher regression model. We find that the student network provably outperforms any solution methods.
arXiv Detail & Related papers (2022-05-30T02:51:36Z)
Neural Capacitance: A New Perspective of Neural Network Selection via Edge Dynamics [85.31710759801705]
Current practice requires expensive computational costs in model training for performance prediction. We propose a novel framework for neural network selection by analyzing the governing dynamics over synaptic connections (edges) during training. Our framework is built on the fact that back-propagation during neural network training is equivalent to the dynamical evolution of synaptic connections.
arXiv Detail & Related papers (2022-01-11T20:53:15Z)
The Separation Capacity of Random Neural Networks [78.25060223808936]
We show that a sufficiently large two-layer ReLU-network with standard Gaussian weights and uniformly distributed biases can solve this problem with high probability. We quantify the relevant structure of the data in terms of a novel notion of mutual complexity.
arXiv Detail & Related papers (2021-07-31T10:25:26Z)
Locality defeats the curse of dimensionality in convolutional teacher-student scenarios [69.2027612631023]
We show that locality is key in determining the learning curve exponent $beta$. We conclude by proving, using a natural assumption, that performing kernel regression with a ridge that decreases with the size of the training set leads to similar learning curve exponents to those we obtain in the ridgeless case.
arXiv Detail & Related papers (2021-06-16T08:27:31Z)
A Local Convergence Theory for Mildly Over-Parameterized Two-Layer Neural Network [39.341620528427306]
We develop a local convergence theory for mildly over- parameterized neural networks. We show that as long as the loss is already lower than a threshold, all student neurons converge to one of teacher neurons. Our result holds for any number of student neurons as long as it is at least as large as the number of teacher neurons.
arXiv Detail & Related papers (2021-02-04T04:41:04Z)
Neuron-based explanations of neural networks sacrifice completeness and interpretability [67.53271920386851]
We show that for AlexNet pretrained on ImageNet, neuron-based explanation methods sacrifice both completeness and interpretability. We show the most important principal components provide more complete and interpretable explanations than the most important neurons. Our findings suggest that explanation methods for networks like AlexNet should avoid using neurons as a basis for embeddings.
arXiv Detail & Related papers (2020-11-05T21:26:03Z)
Towards Understanding Hierarchical Learning: Benefits of Neural Representations [160.33479656108926]
In this work, we demonstrate that intermediate neural representations add more flexibility to neural networks. We show that neural representation can achieve improved sample complexities compared with the raw input. Our results characterize when neural representations are beneficial, and may provide a new perspective on why depth is important in deep learning.
arXiv Detail & Related papers (2020-06-24T02:44:54Z)
Network size and weights size for memorization with two-layers neural networks [15.333300054767726]
We propose a new training procedure for ReLU networks, based on complex (as opposed to real) recombination of the neurons. We show approximate memorization with both $Oleft(fracnd cdot fraclog(1/epsilon)epsilonright)$ neurons, as well as nearly-optimal size of the weights.
arXiv Detail & Related papers (2020-06-04T13:44:57Z)
Non-linear Neurons with Human-like Apical Dendrite Activations [81.18416067005538]
We show that a standard neuron followed by our novel apical dendrite activation (ADA) can learn the XOR logical function with 100% accuracy. We conduct experiments on six benchmark data sets from computer vision, signal processing and natural language processing.
arXiv Detail & Related papers (2020-02-02T21:09:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.