Related papers: Improving the Trainability of Deep Neural Networks through Layerwise Batch-Entropy Regularization

Improving the Trainability of Deep Neural Networks through Layerwise Batch-Entropy Regularization

URL: http://arxiv.org/abs/2208.01134v1
Date: Mon, 1 Aug 2022 20:31:58 GMT
Title: Improving the Trainability of Deep Neural Networks through Layerwise Batch-Entropy Regularization
Authors: David Peer, Bart Keulen, Sebastian Stabinger, Justus Piater, Antonio Rodr\'iguez-S\'anchez
Abstract summary: We introduce and evaluate the batch-entropy which quantifies the flow of information through each layer of a neural network. We show that we can train a "vanilla" fully connected network and convolutional neural network with 500 layers by simply adding the batch-entropy regularization term to the loss function.
Score: 1.3999481573773072
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training deep neural networks is a very demanding task, especially challenging is how to adapt architectures to improve the performance of trained models. We can find that sometimes, shallow networks generalize better than deep networks, and the addition of more layers results in higher training and test errors. The deep residual learning framework addresses this degradation problem by adding skip connections to several neural network layers. It would at first seem counter-intuitive that such skip connections are needed to train deep networks successfully as the expressivity of a network would grow exponentially with depth. In this paper, we first analyze the flow of information through neural networks. We introduce and evaluate the batch-entropy which quantifies the flow of information through each layer of a neural network. We prove empirically and theoretically that a positive batch-entropy is required for gradient descent-based training approaches to optimize a given loss function successfully. Based on those insights, we introduce batch-entropy regularization to enable gradient descent-based training algorithms to optimize the flow of information through each hidden layer individually. With batch-entropy regularization, gradient descent optimizers can transform untrainable networks into trainable networks. We show empirically that we can therefore train a "vanilla" fully connected network and convolutional neural network -- no skip connections, batch normalization, dropout, or any other architectural tweak -- with 500 layers by simply adding the batch-entropy regularization term to the loss function. The effect of batch-entropy regularization is not only evaluated on vanilla neural networks, but also on residual networks, autoencoders, and also transformer models over a wide range of computer vision as well as natural language processing tasks.

Related papers

Sensitivity-Based Layer Insertion for Residual and Feedforward Neural Networks [0.3831327965422187]
Training of neural networks requires tedious and often manual tuning of the network architecture. We propose a systematic method to insert new layers during the training process, which eliminates the need to choose a fixed network size before training.
arXiv Detail & Related papers (2023-11-27T16:44:13Z)
Globally Optimal Training of Neural Networks with Threshold Activation Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations. We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z)
Why Lottery Ticket Wins? A Theoretical Perspective of Sample Complexity on Pruned Neural Networks [79.74580058178594]
We analyze the performance of training a pruned neural network by analyzing the geometric structure of the objective function. We show that the convex region near a desirable model with guaranteed generalization enlarges as the neural network model is pruned.
arXiv Detail & Related papers (2021-10-12T01:11:07Z)
Predify: Augmenting deep neural networks with brain-inspired predictive coding dynamics [0.5284812806199193]
We take inspiration from a popular framework in neuroscience: 'predictive coding' We show that implementing this strategy into two popular networks, VGG16 and EfficientNetB0, improves their robustness against various corruptions.
arXiv Detail & Related papers (2021-06-04T22:48:13Z)
Local Critic Training for Model-Parallel Learning of Deep Neural Networks [94.69202357137452]
We propose a novel model-parallel learning method, called local critic training. We show that the proposed approach successfully decouples the update process of the layer groups for both convolutional neural networks (CNNs) and recurrent neural networks (RNNs) We also show that trained networks by the proposed method can be used for structural optimization.
arXiv Detail & Related papers (2021-02-03T09:30:45Z)
Bayesian Nested Neural Networks for Uncertainty Calibration and Adaptive Compression [40.35734017517066]
Nested networks or slimmable networks are neural networks whose architectures can be adjusted instantly during testing time. Recent studies have focused on a "nested dropout" layer, which is able to order the nodes of a layer by importance during training.
arXiv Detail & Related papers (2021-01-27T12:34:58Z)
Implicit recurrent networks: A novel approach to stationary input processing with recurrent neural networks in deep learning [0.0]
In this work, we introduce and test a novel implementation of recurrent neural networks into deep learning. We provide an algorithm which implements the backpropagation algorithm on a implicit implementation of recurrent networks. A single-layer implicit recurrent network is able to solve the XOR problem, while a feed-forward network with monotonically increasing activation function fails at this task.
arXiv Detail & Related papers (2020-10-20T18:55:32Z)
Compressive sensing with un-trained neural networks: Gradient descent finds the smoothest approximation [60.80172153614544]
Un-trained convolutional neural networks have emerged as highly successful tools for image recovery and restoration. We show that an un-trained convolutional neural network can approximately reconstruct signals and images that are sufficiently structured, from a near minimal number of random measurements.
arXiv Detail & Related papers (2020-05-07T15:57:25Z)
Large-Scale Gradient-Free Deep Learning with Recursive Local Representation Alignment [84.57874289554839]
Training deep neural networks on large-scale datasets requires significant hardware resources. Backpropagation, the workhorse for training these networks, is an inherently sequential process that is difficult to parallelize. We propose a neuro-biologically-plausible alternative to backprop that can be used to train deep networks.
arXiv Detail & Related papers (2020-02-10T16:20:02Z)
A Deep Conditioning Treatment of Neural Networks [37.192369308257504]
We show that depth improves trainability of neural networks by improving the conditioning of certain kernel matrices of the input data. We provide versions of the result that hold for training just the top layer of the neural network, as well as for training all layers via the neural tangent kernel.
arXiv Detail & Related papers (2020-02-04T20:21:36Z)
Side-Tuning: A Baseline for Network Adaptation via Additive Side Networks [95.51368472949308]
Adaptation can be useful in cases when training data is scarce, or when one wishes to encode priors in the network. In this paper, we propose a straightforward alternative: side-tuning.
arXiv Detail & Related papers (2019-12-31T18:52:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.