Deep Gated Networks: A framework to understand training and
generalisation in deep learning
- URL: http://arxiv.org/abs/2002.03996v2
- Date: Mon, 2 Mar 2020 17:25:46 GMT
- Title: Deep Gated Networks: A framework to understand training and
generalisation in deep learning
- Authors: Chandrashekar Lakshminarayanan and Amit Vikram Singh
- Abstract summary: We make use of deep gated networks (DGNs) as a framework to obtain insights about DNNs with ReLU activation.
Our theory throws light on two questions namely why increasing depth till a point helps in training and why increasing depth beyond a point hurts training.
- Score: 3.6954802719347426
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding the role of (stochastic) gradient descent (SGD) in the training
and generalisation of deep neural networks (DNNs) with ReLU activation has been
the object study in the recent past. In this paper, we make use of deep gated
networks (DGNs) as a framework to obtain insights about DNNs with ReLU
activation. In DGNs, a single neuronal unit has two components namely the
pre-activation input (equal to the inner product the weights of the layer and
the previous layer outputs), and a gating value which belongs to $[0,1]$ and
the output of the neuronal unit is equal to the multiplication of
pre-activation input and the gating value. The standard DNN with ReLU
activation, is a special case of the DGNs, wherein the gating value is $1/0$
based on whether or not the pre-activation input is positive or negative. We
theoretically analyse and experiment with several variants of DGNs, each
variant suited to understand a particular aspect of either training or
generalisation in DNNs with ReLU activation. Our theory throws light on two
questions namely i) why increasing depth till a point helps in training and ii)
why increasing depth beyond a point hurts training? We also present
experimental evidence to show that gate adaptation, i.e., the change of gating
value through the course of training is key for generalisation.
Related papers
- Deep Networks Always Grok and Here is Why [15.327649172531606]
Grokking, or delayed generalization, is a phenomenon where generalization in a deep neural network (DNN) occurs long after achieving near zero training error.
We demonstrate that grokking is actually much more widespread and materializes in a wide range of practical settings.
arXiv Detail & Related papers (2024-02-23T18:59:31Z) - Label Deconvolution for Node Representation Learning on Large-scale
Attributed Graphs against Learning Bias [75.44877675117749]
We propose an efficient label regularization technique, namely Label Deconvolution (LD), to alleviate the learning bias by a novel and highly scalable approximation to the inverse mapping of GNNs.
Experiments demonstrate LD significantly outperforms state-of-the-art methods on Open Graph datasets Benchmark.
arXiv Detail & Related papers (2023-09-26T13:09:43Z) - Graph Neural Networks Provably Benefit from Structural Information: A
Feature Learning Perspective [53.999128831324576]
Graph neural networks (GNNs) have pioneered advancements in graph representation learning.
This study investigates the role of graph convolution within the context of feature learning theory.
arXiv Detail & Related papers (2023-06-24T10:21:11Z) - Benign Overfitting in Deep Neural Networks under Lazy Training [72.28294823115502]
We show that when the data distribution is well-separated, DNNs can achieve Bayes-optimal test error for classification.
Our results indicate that interpolating with smoother functions leads to better generalization.
arXiv Detail & Related papers (2023-05-30T19:37:44Z) - On Feature Learning in Neural Networks with Global Convergence
Guarantees [49.870593940818715]
We study the optimization of wide neural networks (NNs) via gradient flow (GF)
We show that when the input dimension is no less than the size of the training set, the training loss converges to zero at a linear rate under GF.
We also show empirically that, unlike in the Neural Tangent Kernel (NTK) regime, our multi-layer model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.
arXiv Detail & Related papers (2022-04-22T15:56:43Z) - Explicitising The Implicit Intrepretability of Deep Neural Networks Via
Duality [5.672223170618133]
Recent work by Lakshminarayanan and Singh provided a dual view for fully connected deep neural networks (DNNs) with rectified linear units (ReLU)
arXiv Detail & Related papers (2022-03-01T03:08:21Z) - Disentangling deep neural networks with rectified linear units using
duality [4.683806391173103]
We propose a novel interpretable counterpart of deep neural networks (DNNs) with rectified linear units (ReLUs)
We show that convolution with global pooling and skip connection provide respectively rotational invariance and ensemble structure to the neural path kernel (NPK)
arXiv Detail & Related papers (2021-10-06T16:51:59Z) - Analyzing Finite Neural Networks: Can We Trust Neural Tangent Kernel
Theory? [2.0711789781518752]
Neural Kernel (NTK) theory is widely used to study the dynamics of infinitely-wide deep neural networks (DNNs) under gradient descent.
We study empirically when NTK theory is valid in practice for fully-connected ReLU and sigmoid DNNs.
In particular, NTK theory does not explain the behavior of sufficiently deep networks so that their gradients explode as they propagate through the network's layers.
arXiv Detail & Related papers (2020-12-08T15:19:45Z) - Optimization and Generalization Analysis of Transduction through
Gradient Boosting and Application to Multi-scale Graph Neural Networks [60.22494363676747]
It is known that the current graph neural networks (GNNs) are difficult to make themselves deep due to the problem known as over-smoothing.
Multi-scale GNNs are a promising approach for mitigating the over-smoothing problem.
We derive the optimization and generalization guarantees of transductive learning algorithms that include multi-scale GNNs.
arXiv Detail & Related papers (2020-06-15T17:06:17Z) - Neural Path Features and Neural Path Kernel : Understanding the role of
gates in deep learning [3.6954802719347426]
This paper analytically characterises the role of active sub-networks in deep learning.
We encode the on/off state of the gates of a given input in a novel 'neural path feature' (NPF)
We show that the output of network is indeed the inner product of NPF and NPV.
arXiv Detail & Related papers (2020-06-11T19:36:40Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.