Improving the Trainability of Deep Neural Networks through Layerwise
Batch-Entropy Regularization
- URL: http://arxiv.org/abs/2208.01134v1
- Date: Mon, 1 Aug 2022 20:31:58 GMT
- Title: Improving the Trainability of Deep Neural Networks through Layerwise
Batch-Entropy Regularization
- Authors: David Peer, Bart Keulen, Sebastian Stabinger, Justus Piater, Antonio
Rodr\'iguez-S\'anchez
- Abstract summary: We introduce and evaluate the batch-entropy which quantifies the flow of information through each layer of a neural network.
We show that we can train a "vanilla" fully connected network and convolutional neural network with 500 layers by simply adding the batch-entropy regularization term to the loss function.
- Score: 1.3999481573773072
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training deep neural networks is a very demanding task, especially
challenging is how to adapt architectures to improve the performance of trained
models. We can find that sometimes, shallow networks generalize better than
deep networks, and the addition of more layers results in higher training and
test errors. The deep residual learning framework addresses this degradation
problem by adding skip connections to several neural network layers. It would
at first seem counter-intuitive that such skip connections are needed to train
deep networks successfully as the expressivity of a network would grow
exponentially with depth. In this paper, we first analyze the flow of
information through neural networks. We introduce and evaluate the
batch-entropy which quantifies the flow of information through each layer of a
neural network. We prove empirically and theoretically that a positive
batch-entropy is required for gradient descent-based training approaches to
optimize a given loss function successfully. Based on those insights, we
introduce batch-entropy regularization to enable gradient descent-based
training algorithms to optimize the flow of information through each hidden
layer individually. With batch-entropy regularization, gradient descent
optimizers can transform untrainable networks into trainable networks. We show
empirically that we can therefore train a "vanilla" fully connected network and
convolutional neural network -- no skip connections, batch normalization,
dropout, or any other architectural tweak -- with 500 layers by simply adding
the batch-entropy regularization term to the loss function. The effect of
batch-entropy regularization is not only evaluated on vanilla neural networks,
but also on residual networks, autoencoders, and also transformer models over a
wide range of computer vision as well as natural language processing tasks.
Related papers
- Sensitivity-Based Layer Insertion for Residual and Feedforward Neural
Networks [0.3831327965422187]
Training of neural networks requires tedious and often manual tuning of the network architecture.
We propose a systematic method to insert new layers during the training process, which eliminates the need to choose a fixed network size before training.
arXiv Detail & Related papers (2023-11-27T16:44:13Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Why Lottery Ticket Wins? A Theoretical Perspective of Sample Complexity
on Pruned Neural Networks [79.74580058178594]
We analyze the performance of training a pruned neural network by analyzing the geometric structure of the objective function.
We show that the convex region near a desirable model with guaranteed generalization enlarges as the neural network model is pruned.
arXiv Detail & Related papers (2021-10-12T01:11:07Z) - Predify: Augmenting deep neural networks with brain-inspired predictive
coding dynamics [0.5284812806199193]
We take inspiration from a popular framework in neuroscience: 'predictive coding'
We show that implementing this strategy into two popular networks, VGG16 and EfficientNetB0, improves their robustness against various corruptions.
arXiv Detail & Related papers (2021-06-04T22:48:13Z) - Local Critic Training for Model-Parallel Learning of Deep Neural
Networks [94.69202357137452]
We propose a novel model-parallel learning method, called local critic training.
We show that the proposed approach successfully decouples the update process of the layer groups for both convolutional neural networks (CNNs) and recurrent neural networks (RNNs)
We also show that trained networks by the proposed method can be used for structural optimization.
arXiv Detail & Related papers (2021-02-03T09:30:45Z) - Bayesian Nested Neural Networks for Uncertainty Calibration and Adaptive
Compression [40.35734017517066]
Nested networks or slimmable networks are neural networks whose architectures can be adjusted instantly during testing time.
Recent studies have focused on a "nested dropout" layer, which is able to order the nodes of a layer by importance during training.
arXiv Detail & Related papers (2021-01-27T12:34:58Z) - Implicit recurrent networks: A novel approach to stationary input
processing with recurrent neural networks in deep learning [0.0]
In this work, we introduce and test a novel implementation of recurrent neural networks into deep learning.
We provide an algorithm which implements the backpropagation algorithm on a implicit implementation of recurrent networks.
A single-layer implicit recurrent network is able to solve the XOR problem, while a feed-forward network with monotonically increasing activation function fails at this task.
arXiv Detail & Related papers (2020-10-20T18:55:32Z) - Compressive sensing with un-trained neural networks: Gradient descent
finds the smoothest approximation [60.80172153614544]
Un-trained convolutional neural networks have emerged as highly successful tools for image recovery and restoration.
We show that an un-trained convolutional neural network can approximately reconstruct signals and images that are sufficiently structured, from a near minimal number of random measurements.
arXiv Detail & Related papers (2020-05-07T15:57:25Z) - Large-Scale Gradient-Free Deep Learning with Recursive Local
Representation Alignment [84.57874289554839]
Training deep neural networks on large-scale datasets requires significant hardware resources.
Backpropagation, the workhorse for training these networks, is an inherently sequential process that is difficult to parallelize.
We propose a neuro-biologically-plausible alternative to backprop that can be used to train deep networks.
arXiv Detail & Related papers (2020-02-10T16:20:02Z) - A Deep Conditioning Treatment of Neural Networks [37.192369308257504]
We show that depth improves trainability of neural networks by improving the conditioning of certain kernel matrices of the input data.
We provide versions of the result that hold for training just the top layer of the neural network, as well as for training all layers via the neural tangent kernel.
arXiv Detail & Related papers (2020-02-04T20:21:36Z) - Side-Tuning: A Baseline for Network Adaptation via Additive Side
Networks [95.51368472949308]
Adaptation can be useful in cases when training data is scarce, or when one wishes to encode priors in the network.
In this paper, we propose a straightforward alternative: side-tuning.
arXiv Detail & Related papers (2019-12-31T18:52:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.