Rapid training of deep neural networks without skip connections or
normalization layers using Deep Kernel Shaping
- URL: http://arxiv.org/abs/2110.01765v1
- Date: Tue, 5 Oct 2021 00:49:36 GMT
- Title: Rapid training of deep neural networks without skip connections or
normalization layers using Deep Kernel Shaping
- Authors: James Martens, Andy Ballard, Guillaume Desjardins, Grzegorz Swirszcz,
Valentin Dalibard, Jascha Sohl-Dickstein, Samuel S. Schoenholz
- Abstract summary: We identify the main pathologies present in deep networks that prevent them from training fast and generalizing to unseen data.
We show how these can be avoided by carefully controlling the "shape" of the network's kernel function.
- Score: 46.083745557823164
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Using an extended and formalized version of the Q/C map analysis of Poole et
al. (2016), along with Neural Tangent Kernel theory, we identify the main
pathologies present in deep networks that prevent them from training fast and
generalizing to unseen data, and show how these can be avoided by carefully
controlling the "shape" of the network's initialization-time kernel function.
We then develop a method called Deep Kernel Shaping (DKS), which accomplishes
this using a combination of precise parameter initialization, activation
function transformations, and small architectural tweaks, all of which preserve
the model class. In our experiments we show that DKS enables SGD training of
residual networks without normalization layers on Imagenet and CIFAR-10
classification tasks at speeds comparable to standard ResNetV2 and Wide-ResNet
models, with only a small decrease in generalization performance. And when
using K-FAC as the optimizer, we achieve similar results for networks without
skip connections. Our results apply for a large variety of activation
functions, including those which traditionally perform very badly, such as the
logistic sigmoid. In addition to DKS, we contribute a detailed analysis of skip
connections, normalization layers, special activation functions like RELU and
SELU, and various initialization schemes, explaining their effectiveness as
alternative (and ultimately incomplete) ways of "shaping" the network's
initialization-time kernel.
Related papers
- Fixing the NTK: From Neural Network Linearizations to Exact Convex
Programs [63.768739279562105]
We show that for a particular choice of mask weights that do not depend on the learning targets, this kernel is equivalent to the NTK of the gated ReLU network on the training data.
A consequence of this lack of dependence on the targets is that the NTK cannot perform better than the optimal MKL kernel on the training set.
arXiv Detail & Related papers (2023-09-26T17:42:52Z) - Local Kernel Renormalization as a mechanism for feature learning in
overparametrized Convolutional Neural Networks [0.0]
Empirical evidence shows that fully-connected neural networks in the infinite-width limit eventually outperform their finite-width counterparts.
State-of-the-art architectures with convolutional layers achieve optimal performances in the finite-width regime.
We show that the generalization performance of a finite-width FC network can be obtained by an infinite-width network, with a suitable choice of the Gaussian priors.
arXiv Detail & Related papers (2023-07-21T17:22:04Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK [86.45209429863858]
We study training one-hidden-layer ReLU networks in the neural tangent kernel (NTK) regime.
We show that the neural networks possess a different limiting kernel which we call textitbias-generalized NTK
We also study various properties of the neural networks with this new kernel.
arXiv Detail & Related papers (2023-01-01T02:11:39Z) - Simple initialization and parametrization of sinusoidal networks via
their kernel bandwidth [92.25666446274188]
sinusoidal neural networks with activations have been proposed as an alternative to networks with traditional activation functions.
We first propose a simplified version of such sinusoidal neural networks, which allows both for easier practical implementation and simpler theoretical analysis.
We then analyze the behavior of these networks from the neural tangent kernel perspective and demonstrate that their kernel approximates a low-pass filter with an adjustable bandwidth.
arXiv Detail & Related papers (2022-11-26T07:41:48Z) - Critical Initialization of Wide and Deep Neural Networks through Partial
Jacobians: General Theory and Applications [6.579523168465526]
We introduce emphpartial Jacobians of a network, defined as derivatives of preactivations in layer $l$ with respect to preactivations in layer $l_0leq l$.
We derive recurrence relations for the norms of partial Jacobians and utilize these relations to analyze criticality of deep fully connected neural networks with LayerNorm and/or residual connections.
arXiv Detail & Related papers (2021-11-23T20:31:42Z) - Finite Versus Infinite Neural Networks: an Empirical Study [69.07049353209463]
kernel methods outperform fully-connected finite-width networks.
Centered and ensembled finite networks have reduced posterior variance.
Weight decay and the use of a large learning rate break the correspondence between finite and infinite networks.
arXiv Detail & Related papers (2020-07-31T01:57:47Z) - The Surprising Simplicity of the Early-Time Learning Dynamics of Neural
Networks [43.860358308049044]
In work, we show that these common perceptions can be completely false in the early phase of learning.
We argue that this surprising simplicity can persist in networks with more layers with convolutional architecture.
arXiv Detail & Related papers (2020-06-25T17:42:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.