PQK: Model Compression via Pruning, Quantization, and Knowledge
Distillation
- URL: http://arxiv.org/abs/2106.14681v1
- Date: Fri, 25 Jun 2021 07:24:53 GMT
- Title: PQK: Model Compression via Pruning, Quantization, and Knowledge
Distillation
- Authors: Jangho Kim, Simyung Chang and Nojun Kwak
- Abstract summary: We propose a novel model compression method called PQK consisting of pruning, quantization, and knowledge distillation processes.
PQK makes use of unimportant weights pruned in the pruning process to make a teacher network for training a better student network without pre-training the teacher model.
We apply our method to the recognition model and verify the effectiveness of PQK on keyword spotting (KWS) and image recognition.
- Score: 43.45412122086056
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As edge devices become prevalent, deploying Deep Neural Networks (DNN) on
edge devices has become a critical issue. However, DNN requires a high
computational resource which is rarely available for edge devices. To handle
this, we propose a novel model compression method for the devices with limited
computational resources, called PQK consisting of pruning, quantization, and
knowledge distillation (KD) processes. Unlike traditional pruning and KD, PQK
makes use of unimportant weights pruned in the pruning process to make a
teacher network for training a better student network without pre-training the
teacher model. PQK has two phases. Phase 1 exploits iterative pruning and
quantization-aware training to make a lightweight and power-efficient model. In
phase 2, we make a teacher network by adding unimportant weights unused in
phase 1 to a pruned network. By using this teacher network, we train the pruned
network as a student network. In doing so, we do not need a pre-trained teacher
network for the KD framework because the teacher and the student networks
coexist within the same network. We apply our method to the recognition model
and verify the effectiveness of PQK on keyword spotting (KWS) and image
recognition.
Related papers
- Adaptive Teaching with Shared Classifier for Knowledge Distillation [6.03477652126575]
Knowledge distillation (KD) is a technique used to transfer knowledge from a teacher network to a student network.
We propose adaptive teaching with a shared classifier (ATSC)
Our approach achieves state-of-the-art results on the CIFAR-100 and ImageNet datasets in both single-teacher and multiteacher scenarios.
arXiv Detail & Related papers (2024-06-12T08:51:08Z) - BD-KD: Balancing the Divergences for Online Knowledge Distillation [12.27903419909491]
We propose BD-KD: Balancing of Divergences for online Knowledge Distillation.
We show that adaptively balancing between the reverse and forward divergences shifts the focus of the training strategy to the compact student network.
We demonstrate that, by performing this balancing design at the level of the student distillation loss, we improve upon both performance accuracy and calibration of the compact student network.
arXiv Detail & Related papers (2022-12-25T22:27:32Z) - Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models.
We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers.
A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z) - Excess Risk of Two-Layer ReLU Neural Networks in Teacher-Student
Settings and its Superiority to Kernel Methods [58.44819696433327]
We investigate the risk of two-layer ReLU neural networks in a teacher regression model.
We find that the student network provably outperforms any solution methods.
arXiv Detail & Related papers (2022-05-30T02:51:36Z) - Simultaneous Training of Partially Masked Neural Networks [67.19481956584465]
We show that it is possible to train neural networks in such a way that a predefined 'core' subnetwork can be split-off from the trained full network with remarkable good performance.
We show that training a Transformer with a low-rank core gives a low-rank model with superior performance than when training the low-rank model alone.
arXiv Detail & Related papers (2021-06-16T15:57:51Z) - Stochastic Precision Ensemble: Self-Knowledge Distillation for Quantized
Deep Neural Networks [27.533162215182422]
quantization of deep neural networks (QDNNs) has been actively studied for deployment in edge devices.
Recent studies employ the knowledge distillation (KD) method to improve the performance of quantized networks.
In this study, we propose ensemble training for QDNNs (SPEQ)
arXiv Detail & Related papers (2020-09-30T08:38:37Z) - HALO: Learning to Prune Neural Networks with Shrinkage [5.283963846188862]
Deep neural networks achieve state-of-the-art performance in a variety of tasks by extracting a rich set of features from unstructured data.
Modern techniques for inducing sparsity and reducing model size are (1) network pruning, (2) training with a sparsity inducing penalty, and (3) training a binary mask jointly with the weights of the network.
We present a novel penalty called Hierarchical Adaptive Lasso which learns to adaptively sparsify weights of a given network via trainable parameters.
arXiv Detail & Related papers (2020-08-24T04:08:48Z) - Adjoined Networks: A Training Paradigm with Applications to Network
Compression [3.995047443480282]
We introduce Adjoined Networks, or AN, a learning paradigm that trains both the original base network and the smaller compressed network together.
Using ResNet-50 as the base network, AN achieves 71.8% top-1 accuracy with only 1.8M parameters and 1.6 GFLOPs on the ImageNet data-set.
We propose Differentiable Adjoined Networks (DAN), a training paradigm that augments AN by using neural architecture search to jointly learn both the width and the weights for each layer of the smaller network.
arXiv Detail & Related papers (2020-06-10T02:48:16Z) - Efficient Crowd Counting via Structured Knowledge Transfer [122.30417437707759]
Crowd counting is an application-oriented task and its inference efficiency is crucial for real-world applications.
We propose a novel Structured Knowledge Transfer framework to generate a lightweight but still highly effective student network.
Our models obtain at least 6.5$times$ speed-up on an Nvidia 1080 GPU and even achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-03-23T08:05:41Z) - A "Network Pruning Network" Approach to Deep Model Compression [62.68120664998911]
We present a filter pruning approach for deep model compression using a multitask network.
Our approach is based on learning a a pruner network to prune a pre-trained target network.
The compressed model produced by our approach is generic and does not need any special hardware/software support.
arXiv Detail & Related papers (2020-01-15T20:38:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.