Related papers: Go Wide, Then Narrow: Efficient Training of Deep Thin Networks

Go Wide, Then Narrow: Efficient Training of Deep Thin Networks

URL: http://arxiv.org/abs/2007.00811v2
Date: Mon, 17 Aug 2020 17:43:30 GMT
Title: Go Wide, Then Narrow: Efficient Training of Deep Thin Networks
Authors: Denny Zhou, Mao Ye, Chen Chen, Tianjian Meng, Mingxing Tan, Xiaodan Song, Quoc Le, Qiang Liu, and Dale Schuurmans
Abstract summary: We propose an efficient method to train a deep thin network with a theoretic guarantee. By training with our method, ResNet50 can outperform ResNet101, and BERT Base can be comparable with BERT Large.
Score: 62.26044348366186
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: For deploying a deep learning model into production, it needs to be both accurate and compact to meet the latency and memory constraints. This usually results in a network that is deep (to ensure performance) and yet thin (to improve computational efficiency). In this paper, we propose an efficient method to train a deep thin network with a theoretic guarantee. Our method is motivated by model compression. It consists of three stages. First, we sufficiently widen the deep thin network and train it until convergence. Then, we use this well-trained deep wide network to warm up (or initialize) the original deep thin network. This is achieved by layerwise imitation, that is, forcing the thin network to mimic the intermediate outputs of the wide network from layer to layer. Finally, we further fine tune this already well-initialized deep thin network. The theoretical guarantee is established by using the neural mean field analysis. It demonstrates the advantage of our layerwise imitation approach over backpropagation. We also conduct large-scale empirical experiments to validate the proposed method. By training with our method, ResNet50 can outperform ResNet101, and BERT Base can be comparable with BERT Large, when ResNet101 and BERT Large are trained under the standard training procedures as in the literature.

Related papers

Deep Fusion: Efficient Network Training via Pre-trained Initializations [3.9146761527401424]
We present Deep Fusion, an efficient approach to network training that leverages pre-trained initializations of smaller networks. Our experiments show how Deep Fusion is a practical and effective approach that not only accelerates the training process but also reduces computational requirements. We validate our theoretical framework, which guides the optimal use of Deep Fusion, showing that it significantly reduces both training time and resource consumption.
arXiv Detail & Related papers (2023-06-20T21:30:54Z)
Comparison between layer-to-layer network training and conventional network training using Deep Convolutional Neural Networks [0.6853165736531939]
Convolutional neural networks (CNNs) are widely used in various applications due to their effectiveness in extracting features from data. We propose a layer-to-layer training method and compare its performance with the conventional training method. Our experiments show that the layer-to-layer training method outperforms the conventional training method for both models.
arXiv Detail & Related papers (2023-03-27T14:29:18Z)
Layer Folding: Neural Network Depth Reduction using Activation Linearization [0.0]
Modern devices exhibit a high level of parallelism, but real-time latency is still highly dependent on networks' depth. We propose a method that learns whether non-linear activations can be removed, allowing to fold consecutive linear layers into one. We apply our method to networks pre-trained on CIFAR-10 and CIFAR-100 and find that they can all be transformed into shallower forms that share a similar depth.
arXiv Detail & Related papers (2021-06-17T08:22:46Z)
Simultaneous Training of Partially Masked Neural Networks [67.19481956584465]
We show that it is possible to train neural networks in such a way that a predefined 'core' subnetwork can be split-off from the trained full network with remarkable good performance. We show that training a Transformer with a low-rank core gives a low-rank model with superior performance than when training the low-rank model alone.
arXiv Detail & Related papers (2021-06-16T15:57:51Z)
BCNet: Searching for Network Width with Bilaterally Coupled Network [56.14248440683152]
We introduce a new supernet called Bilaterally Coupled Network (BCNet) to address this issue. In BCNet, each channel is fairly trained and responsible for the same amount of network widths, thus each network width can be evaluated more accurately. Our method achieves state-of-the-art or competing performance over other baseline methods.
arXiv Detail & Related papers (2021-05-21T18:54:03Z)
Manifold Regularized Dynamic Network Pruning [102.24146031250034]
This paper proposes a new paradigm that dynamically removes redundant filters by embedding the manifold information of all instances into the space of pruned networks. The effectiveness of the proposed method is verified on several benchmarks, which shows better performance in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2021-03-10T03:59:03Z)
Network Pruning via Resource Reallocation [75.85066435085595]
We propose a simple yet effective channel pruning technique, termed network Pruning via rEsource rEalLocation (PEEL) PEEL first constructs a predefined backbone and then conducts resource reallocation on it to shift parameters from less informative layers to more important layers in one round. Experimental results show that structures uncovered by PEEL exhibit competitive performance with state-of-the-art pruning algorithms under various pruning settings.
arXiv Detail & Related papers (2021-03-02T16:28:10Z)
Training Larger Networks for Deep Reinforcement Learning [18.193180866998333]
We show that naively increasing network capacity does not improve performance. We propose a novel method that consists of 1) wider networks with DenseNet connection, 2) decoupling representation learning from training of RL, and 3) a distributed training method to mitigate overfitting problems. Using this three-fold technique, we show that we can train very large networks that result in significant performance gains.
arXiv Detail & Related papers (2021-02-16T02:16:54Z)
Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks [78.47459801017959]
Sparsity can reduce the memory footprint of regular networks to fit mobile devices. We describe approaches to remove and add elements of neural networks, different training strategies to achieve model sparsity, and mechanisms to exploit sparsity in practice.
arXiv Detail & Related papers (2021-01-31T22:48:50Z)
Picking Winning Tickets Before Training by Preserving Gradient Flow [9.67608102763644]
We argue that efficient training requires preserving the gradient flow through the network. We empirically investigate the effectiveness of the proposed method with extensive experiments on CIFAR-10, CIFAR-100, Tiny-ImageNet and ImageNet.
arXiv Detail & Related papers (2020-02-18T05:14:47Z)
Activation Density driven Energy-Efficient Pruning in Training [2.222917681321253]
We propose a novel pruning method that prunes a network real-time during training. We obtain exceedingly sparse networks with accuracy comparable to the baseline network.
arXiv Detail & Related papers (2020-02-07T18:34:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.