Towards Understanding the Condensation of Two-layer Neural Networks at
Initial Training
- URL: http://arxiv.org/abs/2105.11686v1
- Date: Tue, 25 May 2021 05:47:55 GMT
- Title: Towards Understanding the Condensation of Two-layer Neural Networks at
Initial Training
- Authors: Zhi-Qin John Xu, Hanxu Zhou, Tao Luo, Yaoyu Zhang
- Abstract summary: We show that the singularity of the activation function at original point is a key factor to understanding the condensation at initial training stage.
Our experiments suggest that the maximal number of condensed orientations is twice of the singularity order.
- Score: 1.1958610985612828
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: It is important to study what implicit regularization is imposed on the loss
function during the training that leads over-parameterized neural networks
(NNs) to good performance on real dataset. Empirically, existing works have
shown that weights of NNs condense on isolated orientations with small
initialization. The condensation implies that the NN learns features from the
training data and is effectively a much smaller network. In this work, we show
that the singularity of the activation function at original point is a key
factor to understanding the condensation at initial training stage. Our
experiments suggest that the maximal number of condensed orientations is twice
of the singularity order. Our theoretical analysis confirms experiments for two
cases, one is for the first-order singularity activation function and the other
is for the one-dimensional input. This work takes a step towards understanding
how small initialization implicitly leads NNs to condensation at initial
training, which is crucial to understand the training and the learning of deep
NNs.
Related papers
- Learning a Neuron by a Shallow ReLU Network: Dynamics and Implicit Bias
for Correlated Inputs [5.7166378791349315]
We prove that, for the fundamental regression task of learning a single neuron, training a one-hidden layer ReLU network converges to zero loss.
We also show and characterise a surprising distinction in this setting between interpolator networks of minimal rank and those of minimal Euclidean norm.
arXiv Detail & Related papers (2023-06-10T16:36:22Z) - Understanding the Initial Condensation of Convolutional Neural Networks [6.451914896767135]
kernels of two-layer convolutional neural networks converge to one or a few directions during training.
This work represents a step towards a better understanding of the non-linear training behavior exhibited by neural networks with specialized structures.
arXiv Detail & Related papers (2023-05-17T05:00:47Z) - Phase Diagram of Initial Condensation for Two-layer Neural Networks [4.404198015660192]
We present a phase diagram of initial condensation for two-layer neural networks.
Our phase diagram serves to provide a comprehensive understanding of the dynamical regimes of neural networks.
arXiv Detail & Related papers (2023-03-12T03:55:38Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - Neural networks trained with SGD learn distributions of increasing
complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics.
We then exploit higher-order statistics only later during training.
We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z) - Limitations of the NTK for Understanding Generalization in Deep Learning [13.44676002603497]
We study NTKs through the lens of scaling laws, and demonstrate that they fall short of explaining important aspects of neural network generalization.
We show that even if the empirical NTK is allowed to be pre-trained on a constant number of samples, the kernel scaling does not catch up to the neural network scaling.
arXiv Detail & Related papers (2022-06-20T21:23:28Z) - How does unlabeled data improve generalization in self-training? A
one-hidden-layer theoretical analysis [93.37576644429578]
This work establishes the first theoretical analysis for the known iterative self-training paradigm.
We prove the benefits of unlabeled data in both training convergence and generalization ability.
Experiments from shallow neural networks to deep neural networks are also provided to justify the correctness of our established theoretical insights on self-training.
arXiv Detail & Related papers (2022-01-21T02:16:52Z) - What training reveals about neural network complexity [80.87515604428346]
This work explores the hypothesis that the complexity of the function a deep neural network (NN) is learning can be deduced by how fast its weights change during training.
Our results support the hypothesis that good training behavior can be a useful bias towards good generalization.
arXiv Detail & Related papers (2021-06-08T08:58:00Z) - Activation function design for deep networks: linearity and effective
initialisation [10.108857371774977]
We study how to avoid two problems at initialisation identified in prior works.
We prove that both these problems can be avoided by choosing an activation function possessing a sufficiently large linear region around the origin.
arXiv Detail & Related papers (2021-05-17T11:30:46Z) - Feature Purification: How Adversarial Training Performs Robust Deep
Learning [66.05472746340142]
We show a principle that we call Feature Purification, where we show one of the causes of the existence of adversarial examples is the accumulation of certain small dense mixtures in the hidden weights during the training process of a neural network.
We present both experiments on the CIFAR-10 dataset to illustrate this principle, and a theoretical result proving that for certain natural classification tasks, training a two-layer neural network with ReLU activation using randomly gradient descent indeed this principle.
arXiv Detail & Related papers (2020-05-20T16:56:08Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.