Embedded Knowledge Distillation in Depth-level Dynamic Neural Network
- URL: http://arxiv.org/abs/2103.00793v1
- Date: Mon, 1 Mar 2021 06:35:31 GMT
- Title: Embedded Knowledge Distillation in Depth-level Dynamic Neural Network
- Authors: Shuchang Lyu, Ting-Bing Xu and Guangliang Cheng
- Abstract summary: We propose an elegant Depth-level Dynamic Neural Network (DDNN) integrated different-depth sub-nets of similar architectures.
In this article, we design the Embedded-Knowledge-Distillation (EKD) training mechanism for the DDNN to implement semantic knowledge transfer from the teacher (full) net to multiple sub-nets.
Experiments on CIFAR-10, CIFAR-100, and ImageNet datasets demonstrate that sub-nets in DDNN with EKD training achieves better performance than the depth-level pruning or individually training.
- Score: 8.207403859762044
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In real applications, different computation-resource devices need
different-depth networks (e.g., ResNet-18/34/50) with high-accuracy. Usually,
existing strategies either design multiple networks (nets) and train them
independently, or utilize compression techniques (e.g., low-rank decomposition,
pruning, and teacher-to-student) to evolve a trained large model into a small
net. These methods are subject to the low-accuracy of small nets, or
complicated training processes induced by the dependence of accompanying
assistive large models. In this article, we propose an elegant Depth-level
Dynamic Neural Network (DDNN) integrated different-depth sub-nets of similar
architectures. Instead of training individual nets with different-depth
configurations, we only train a DDNN to dynamically switch different-depth
sub-nets at runtime using one set of shared weight parameters. To improve the
generalization of sub-nets, we design the Embedded-Knowledge-Distillation (EKD)
training mechanism for the DDNN to implement semantic knowledge transfer from
the teacher (full) net to multiple sub-nets. Specifically, the Kullback-Leibler
divergence is introduced to constrain the posterior class probability
consistency between full-net and sub-net, and self-attention on the same
resolution feature of different depth is addressed to drive more abundant
feature representations of sub-nets. Thus, we can obtain multiple high accuracy
sub-nets simultaneously in a DDNN via the online knowledge distillation in each
training iteration without extra computation cost. Extensive experiments on
CIFAR-10, CIFAR-100, and ImageNet datasets demonstrate that sub-nets in DDNN
with EKD training achieves better performance than the depth-level pruning or
individually training while preserving the original performance of full-net.
Related papers
- Cooperative Learning for Cost-Adaptive Inference [3.301728339780329]
The proposed framework is not tied to any specific architecture but can incorporate any existing models/architectures.
It provides comparable accuracy to its full network while various sizes of models are available.
arXiv Detail & Related papers (2023-12-13T21:42:27Z) - Automated Heterogeneous Low-Bit Quantization of Multi-Model Deep
Learning Inference Pipeline [2.9342849999747624]
Multiple Deep Neural Networks (DNNs) integrated into single Deep Learning (DL) inference pipelines pose challenges for edge deployment.
This paper introduces an automated heterogeneous quantization approach for DL inference pipelines with multiple DNNs.
arXiv Detail & Related papers (2023-11-10T05:02:20Z) - Efficient Implementation of a Multi-Layer Gradient-Free Online-Trainable
Spiking Neural Network on FPGA [0.31498833540989407]
ODESA is the first network to have end-to-end multi-layer online local supervised training without using gradients.
This research shows that the network architecture and the online training of weights and thresholds can be implemented efficiently on a large scale in hardware.
arXiv Detail & Related papers (2023-05-31T00:34:15Z) - Unifying Synergies between Self-supervised Learning and Dynamic
Computation [53.66628188936682]
We present a novel perspective on the interplay between SSL and DC paradigms.
We show that it is feasible to simultaneously learn a dense and gated sub-network from scratch in a SSL setting.
The co-evolution during pre-training of both dense and gated encoder offers a good accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-01-22T17:12:58Z) - Training Spiking Neural Networks with Local Tandem Learning [96.32026780517097]
Spiking neural networks (SNNs) are shown to be more biologically plausible and energy efficient than their predecessors.
In this paper, we put forward a generalized learning rule, termed Local Tandem Learning (LTL)
We demonstrate rapid network convergence within five training epochs on the CIFAR-10 dataset while having low computational complexity.
arXiv Detail & Related papers (2022-10-10T10:05:00Z) - Simultaneous Training of Partially Masked Neural Networks [67.19481956584465]
We show that it is possible to train neural networks in such a way that a predefined 'core' subnetwork can be split-off from the trained full network with remarkable good performance.
We show that training a Transformer with a low-rank core gives a low-rank model with superior performance than when training the low-rank model alone.
arXiv Detail & Related papers (2021-06-16T15:57:51Z) - Learning Neural Network Subspaces [74.44457651546728]
Recent observations have advanced our understanding of the neural network optimization landscape.
With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks.
With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks.
arXiv Detail & Related papers (2021-02-20T23:26:58Z) - Encoding the latent posterior of Bayesian Neural Networks for
uncertainty quantification [10.727102755903616]
We aim for efficient deep BNNs amenable to complex computer vision architectures.
We achieve this by leveraging variational autoencoders (VAEs) to learn the interaction and the latent distribution of the parameters at each network layer.
Our approach, Latent-Posterior BNN (LP-BNN), is compatible with the recent BatchEnsemble method, leading to highly efficient (in terms of computation and memory during both training and testing) ensembles.
arXiv Detail & Related papers (2020-12-04T19:50:09Z) - Dynamic Graph: Learning Instance-aware Connectivity for Neural Networks [78.65792427542672]
Dynamic Graph Network (DG-Net) is a complete directed acyclic graph, where the nodes represent convolutional blocks and the edges represent connection paths.
Instead of using the same path of the network, DG-Net aggregates features dynamically in each node, which allows the network to have more representation ability.
arXiv Detail & Related papers (2020-10-02T16:50:26Z) - Distributed Training of Deep Learning Models: A Taxonomic Perspective [11.924058430461216]
Distributed deep learning systems (DDLS) train deep neural network models by utilizing the distributed resources of a cluster.
We aim to shine some light on the fundamental principles that are at work when training deep neural networks in a cluster of independent machines.
arXiv Detail & Related papers (2020-07-08T08:56:58Z) - Large-Scale Gradient-Free Deep Learning with Recursive Local
Representation Alignment [84.57874289554839]
Training deep neural networks on large-scale datasets requires significant hardware resources.
Backpropagation, the workhorse for training these networks, is an inherently sequential process that is difficult to parallelize.
We propose a neuro-biologically-plausible alternative to backprop that can be used to train deep networks.
arXiv Detail & Related papers (2020-02-10T16:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.