Go Wide, Then Narrow: Efficient Training of Deep Thin Networks
- URL: http://arxiv.org/abs/2007.00811v2
- Date: Mon, 17 Aug 2020 17:43:30 GMT
- Title: Go Wide, Then Narrow: Efficient Training of Deep Thin Networks
- Authors: Denny Zhou, Mao Ye, Chen Chen, Tianjian Meng, Mingxing Tan, Xiaodan
Song, Quoc Le, Qiang Liu, and Dale Schuurmans
- Abstract summary: We propose an efficient method to train a deep thin network with a theoretic guarantee.
By training with our method, ResNet50 can outperform ResNet101, and BERT Base can be comparable with BERT Large.
- Score: 62.26044348366186
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For deploying a deep learning model into production, it needs to be both
accurate and compact to meet the latency and memory constraints. This usually
results in a network that is deep (to ensure performance) and yet thin (to
improve computational efficiency). In this paper, we propose an efficient
method to train a deep thin network with a theoretic guarantee. Our method is
motivated by model compression. It consists of three stages. First, we
sufficiently widen the deep thin network and train it until convergence. Then,
we use this well-trained deep wide network to warm up (or initialize) the
original deep thin network. This is achieved by layerwise imitation, that is,
forcing the thin network to mimic the intermediate outputs of the wide network
from layer to layer. Finally, we further fine tune this already
well-initialized deep thin network. The theoretical guarantee is established by
using the neural mean field analysis. It demonstrates the advantage of our
layerwise imitation approach over backpropagation. We also conduct large-scale
empirical experiments to validate the proposed method. By training with our
method, ResNet50 can outperform ResNet101, and BERT Base can be comparable with
BERT Large, when ResNet101 and BERT Large are trained under the standard
training procedures as in the literature.
Related papers
- Deep Fusion: Efficient Network Training via Pre-trained Initializations [3.9146761527401424]
We present Deep Fusion, an efficient approach to network training that leverages pre-trained initializations of smaller networks.
Our experiments show how Deep Fusion is a practical and effective approach that not only accelerates the training process but also reduces computational requirements.
We validate our theoretical framework, which guides the optimal use of Deep Fusion, showing that it significantly reduces both training time and resource consumption.
arXiv Detail & Related papers (2023-06-20T21:30:54Z) - Comparison between layer-to-layer network training and conventional
network training using Deep Convolutional Neural Networks [0.6853165736531939]
Convolutional neural networks (CNNs) are widely used in various applications due to their effectiveness in extracting features from data.
We propose a layer-to-layer training method and compare its performance with the conventional training method.
Our experiments show that the layer-to-layer training method outperforms the conventional training method for both models.
arXiv Detail & Related papers (2023-03-27T14:29:18Z) - Layer Folding: Neural Network Depth Reduction using Activation
Linearization [0.0]
Modern devices exhibit a high level of parallelism, but real-time latency is still highly dependent on networks' depth.
We propose a method that learns whether non-linear activations can be removed, allowing to fold consecutive linear layers into one.
We apply our method to networks pre-trained on CIFAR-10 and CIFAR-100 and find that they can all be transformed into shallower forms that share a similar depth.
arXiv Detail & Related papers (2021-06-17T08:22:46Z) - Simultaneous Training of Partially Masked Neural Networks [67.19481956584465]
We show that it is possible to train neural networks in such a way that a predefined 'core' subnetwork can be split-off from the trained full network with remarkable good performance.
We show that training a Transformer with a low-rank core gives a low-rank model with superior performance than when training the low-rank model alone.
arXiv Detail & Related papers (2021-06-16T15:57:51Z) - BCNet: Searching for Network Width with Bilaterally Coupled Network [56.14248440683152]
We introduce a new supernet called Bilaterally Coupled Network (BCNet) to address this issue.
In BCNet, each channel is fairly trained and responsible for the same amount of network widths, thus each network width can be evaluated more accurately.
Our method achieves state-of-the-art or competing performance over other baseline methods.
arXiv Detail & Related papers (2021-05-21T18:54:03Z) - Manifold Regularized Dynamic Network Pruning [102.24146031250034]
This paper proposes a new paradigm that dynamically removes redundant filters by embedding the manifold information of all instances into the space of pruned networks.
The effectiveness of the proposed method is verified on several benchmarks, which shows better performance in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2021-03-10T03:59:03Z) - Network Pruning via Resource Reallocation [75.85066435085595]
We propose a simple yet effective channel pruning technique, termed network Pruning via rEsource rEalLocation (PEEL)
PEEL first constructs a predefined backbone and then conducts resource reallocation on it to shift parameters from less informative layers to more important layers in one round.
Experimental results show that structures uncovered by PEEL exhibit competitive performance with state-of-the-art pruning algorithms under various pruning settings.
arXiv Detail & Related papers (2021-03-02T16:28:10Z) - Training Larger Networks for Deep Reinforcement Learning [18.193180866998333]
We show that naively increasing network capacity does not improve performance.
We propose a novel method that consists of 1) wider networks with DenseNet connection, 2) decoupling representation learning from training of RL, and 3) a distributed training method to mitigate overfitting problems.
Using this three-fold technique, we show that we can train very large networks that result in significant performance gains.
arXiv Detail & Related papers (2021-02-16T02:16:54Z) - Sparsity in Deep Learning: Pruning and growth for efficient inference
and training in neural networks [78.47459801017959]
Sparsity can reduce the memory footprint of regular networks to fit mobile devices.
We describe approaches to remove and add elements of neural networks, different training strategies to achieve model sparsity, and mechanisms to exploit sparsity in practice.
arXiv Detail & Related papers (2021-01-31T22:48:50Z) - Picking Winning Tickets Before Training by Preserving Gradient Flow [9.67608102763644]
We argue that efficient training requires preserving the gradient flow through the network.
We empirically investigate the effectiveness of the proposed method with extensive experiments on CIFAR-10, CIFAR-100, Tiny-ImageNet and ImageNet.
arXiv Detail & Related papers (2020-02-18T05:14:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.