Understanding the Initial Condensation of Convolutional Neural Networks
- URL: http://arxiv.org/abs/2305.09947v1
- Date: Wed, 17 May 2023 05:00:47 GMT
- Title: Understanding the Initial Condensation of Convolutional Neural Networks
- Authors: Zhangchen Zhou, Hanxu Zhou, Yuqing Li, Zhi-Qin John Xu
- Abstract summary: kernels of two-layer convolutional neural networks converge to one or a few directions during training.
This work represents a step towards a better understanding of the non-linear training behavior exhibited by neural networks with specialized structures.
- Score: 6.451914896767135
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Previous research has shown that fully-connected networks with small
initialization and gradient-based training methods exhibit a phenomenon known
as condensation during training. This phenomenon refers to the input weights of
hidden neurons condensing into isolated orientations during training, revealing
an implicit bias towards simple solutions in the parameter space. However, the
impact of neural network structure on condensation has not been investigated
yet. In this study, we focus on the investigation of convolutional neural
networks (CNNs). Our experiments suggest that when subjected to small
initialization and gradient-based training methods, kernel weights within the
same CNN layer also cluster together during training, demonstrating a
significant degree of condensation. Theoretically, we demonstrate that in a
finite training period, kernels of a two-layer CNN with small initialization
will converge to one or a few directions. This work represents a step towards a
better understanding of the non-linear training behavior exhibited by neural
networks with specialized structures.
Related papers
- Early Directional Convergence in Deep Homogeneous Neural Networks for
Small Initializations [2.310288676109785]
This paper studies the gradient flow dynamics that arise when training deep homogeneous neural networks.
The weights of the neural network remain small in norm and approximately converge in direction along the Karush-Kuhn-Tucker points.
arXiv Detail & Related papers (2024-03-12T23:17:32Z) - On the dynamics of three-layer neural networks: initial condensation [2.022855152231054]
condensation occurs when gradient methods spontaneously reduce the complexity of neural networks.
We establish the blow-up property of effective dynamics and present a sufficient condition for the occurrence of condensation.
We also explore the association between condensation and the low-rank bias observed in deep matrix factorization.
arXiv Detail & Related papers (2024-02-25T02:36:14Z) - Learning a Neuron by a Shallow ReLU Network: Dynamics and Implicit Bias
for Correlated Inputs [5.7166378791349315]
We prove that, for the fundamental regression task of learning a single neuron, training a one-hidden layer ReLU network converges to zero loss.
We also show and characterise a surprising distinction in this setting between interpolator networks of minimal rank and those of minimal Euclidean norm.
arXiv Detail & Related papers (2023-06-10T16:36:22Z) - Phase Diagram of Initial Condensation for Two-layer Neural Networks [4.404198015660192]
We present a phase diagram of initial condensation for two-layer neural networks.
Our phase diagram serves to provide a comprehensive understanding of the dynamical regimes of neural networks.
arXiv Detail & Related papers (2023-03-12T03:55:38Z) - Gradient Descent in Neural Networks as Sequential Learning in RKBS [63.011641517977644]
We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights.
We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning.
arXiv Detail & Related papers (2023-02-01T03:18:07Z) - Neural networks trained with SGD learn distributions of increasing
complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics.
We then exploit higher-order statistics only later during training.
We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z) - Data-driven emergence of convolutional structure in neural networks [83.4920717252233]
We show how fully-connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs.
By carefully designing data models, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs.
arXiv Detail & Related papers (2022-02-01T17:11:13Z) - What can linearized neural networks actually say about generalization? [67.83999394554621]
In certain infinitely-wide neural networks, the neural tangent kernel (NTK) theory fully characterizes generalization.
We show that the linear approximations can indeed rank the learning complexity of certain tasks for neural networks.
Our work provides concrete examples of novel deep learning phenomena which can inspire future theoretical research.
arXiv Detail & Related papers (2021-06-12T13:05:11Z) - Towards Understanding the Condensation of Two-layer Neural Networks at
Initial Training [1.1958610985612828]
We show that the singularity of the activation function at original point is a key factor to understanding the condensation at initial training stage.
Our experiments suggest that the maximal number of condensed orientations is twice of the singularity order.
arXiv Detail & Related papers (2021-05-25T05:47:55Z) - Feature Purification: How Adversarial Training Performs Robust Deep
Learning [66.05472746340142]
We show a principle that we call Feature Purification, where we show one of the causes of the existence of adversarial examples is the accumulation of certain small dense mixtures in the hidden weights during the training process of a neural network.
We present both experiments on the CIFAR-10 dataset to illustrate this principle, and a theoretical result proving that for certain natural classification tasks, training a two-layer neural network with ReLU activation using randomly gradient descent indeed this principle.
arXiv Detail & Related papers (2020-05-20T16:56:08Z) - A Generalized Neural Tangent Kernel Analysis for Two-layer Neural
Networks [87.23360438947114]
We show that noisy gradient descent with weight decay can still exhibit a " Kernel-like" behavior.
This implies that the training loss converges linearly up to a certain accuracy.
We also establish a novel generalization error bound for two-layer neural networks trained by noisy gradient descent with weight decay.
arXiv Detail & Related papers (2020-02-10T18:56:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.