Related papers: Collaborative Multi-Teacher Knowledge Distillation for Learning Low Bit-width Deep Neural Networks

Collaborative Multi-Teacher Knowledge Distillation for Learning Low Bit-width Deep Neural Networks

URL: http://arxiv.org/abs/2210.16103v1
Date: Thu, 27 Oct 2022 01:03:39 GMT
Title: Collaborative Multi-Teacher Knowledge Distillation for Learning Low Bit-width Deep Neural Networks
Authors: Cuong Pham, Tuan Hoang, Thanh-Toan Do
Abstract summary: We propose a novel framework that leverages both multi-teacher knowledge distillation and network quantization for learning low bit-width DNNs. Our experimental results on CIFAR100 and ImageNet datasets show that the compact quantized student models trained with our method achieve competitive results.
Score: 28.215073725175728
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Knowledge distillation which learns a lightweight student model by distilling knowledge from a cumbersome teacher model is an attractive approach for learning compact deep neural networks (DNNs). Recent works further improve student network performance by leveraging multiple teacher networks. However, most of the existing knowledge distillation-based multi-teacher methods use separately pretrained teachers. This limits the collaborative learning between teachers and the mutual learning between teachers and student. Network quantization is another attractive approach for learning compact DNNs. However, most existing network quantization methods are developed and evaluated without considering multi-teacher support to enhance the performance of quantized student model. In this paper, we propose a novel framework that leverages both multi-teacher knowledge distillation and network quantization for learning low bit-width DNNs. The proposed method encourages both collaborative learning between quantized teachers and mutual learning between quantized teachers and quantized student. During learning process, at corresponding layers, knowledge from teachers will form an importance-aware shared knowledge which will be used as input for teachers at subsequent layers and also be used to guide student. Our experimental results on CIFAR100 and ImageNet datasets show that the compact quantized student models trained with our method achieve competitive results compared to other state-of-the-art methods, and in some cases, indeed surpass the full precision models.

Related papers

Adaptive Multi-Teacher Knowledge Distillation with Meta-Learning [16.293262022872412]
We propose Adaptive Multi-teacher Knowledge Distillation with Meta-Learning (MMKD) to supervise student with appropriate knowledge from a tailored ensemble teacher. With the help of a meta-weight network, the diverse yet compatible teacher knowledge in the output layer and intermediate layers is jointly leveraged to enhance the student performance.
arXiv Detail & Related papers (2023-06-11T09:38:45Z)
Knowledge Distillation via Weighted Ensemble of Teaching Assistants [18.593268785143426]
Knowledge distillation is the process of transferring knowledge from a large model called the teacher to a smaller model called the student. When the network size gap between the teacher and student increases, the performance of the student network decreases. We have shown that using multiple teaching assistant models, the student model (the smaller model) can be further improved.
arXiv Detail & Related papers (2022-06-23T22:50:05Z)
Multi-Modality Distillation via Learning the teacher's modality-level Gram Matrix [3.4793807018498555]
It is necessary to force the student network to learn the modality relationship information of the teacher network. To effectively exploit transfering knowledge from teachers to students, a novel modality relation distillation paradigm by modeling the relationship information among different modality are adopted.
arXiv Detail & Related papers (2021-12-21T18:53:58Z)
Augmenting Knowledge Distillation With Peer-To-Peer Mutual Learning For Model Compression [2.538209532048867]
Mutual Learning (ML) provides an alternative strategy where multiple simple student networks benefit from sharing knowledge. We propose a single-teacher, multi-student framework that leverages both KD and ML to achieve better performance.
arXiv Detail & Related papers (2021-10-21T09:59:31Z)
Iterative Teacher-Aware Learning [136.05341445369265]
In human pedagogy, teachers and students can interact adaptively to maximize communication efficiency. We propose a gradient optimization based teacher-aware learner who can incorporate teacher's cooperative intention into the likelihood function.
arXiv Detail & Related papers (2021-10-01T00:27:47Z)
Adaptive Multi-Teacher Multi-level Knowledge Distillation [11.722728148523366]
We propose a novel adaptive multi-teacher multi-level knowledge distillation learning framework(AMTML-KD) It consists two novel insights: (i) associating each teacher with a latent representation to adaptively learn instance-level teacher importance weights. As such, a student model can learn multi-level knowledge from multiple teachers through AMTML-KD.
arXiv Detail & Related papers (2021-03-06T08:18:16Z)
Learning Student-Friendly Teacher Networks for Knowledge Distillation [50.11640959363315]
We propose a novel knowledge distillation approach to facilitate the transfer of dark knowledge from a teacher to a student. Contrary to most of the existing methods that rely on effective training of student models given pretrained teachers, we aim to learn the teacher models that are friendly to students.
arXiv Detail & Related papers (2021-02-12T07:00:17Z)
Point Adversarial Self Mining: A Simple Method for Facial Expression Recognition [79.75964372862279]
We propose Point Adversarial Self Mining (PASM) to improve the recognition accuracy in facial expression recognition. PASM uses a point adversarial attack method and a trained teacher network to locate the most informative position related to the target task. The adaptive learning materials generation and teacher/student update can be conducted more than one time, improving the network capability iteratively.
arXiv Detail & Related papers (2020-08-26T06:39:24Z)
Interactive Knowledge Distillation [79.12866404907506]
We propose an InterActive Knowledge Distillation scheme to leverage the interactive teaching strategy for efficient knowledge distillation. In the distillation process, the interaction between teacher and student networks is implemented by a swapping-in operation. Experiments with typical settings of teacher-student networks demonstrate that the student networks trained by our IAKD achieve better performance than those trained by conventional knowledge distillation methods.
arXiv Detail & Related papers (2020-07-03T03:22:04Z)
Peer Collaborative Learning for Online Knowledge Distillation [69.29602103582782]
Peer Collaborative Learning method integrates online ensembling and network collaboration into a unified framework. Experiments on CIFAR-10, CIFAR-100 and ImageNet show that the proposed method significantly improves the generalisation of various backbone networks.
arXiv Detail & Related papers (2020-06-07T13:21:52Z)
Heterogeneous Knowledge Distillation using Information Flow Modeling [82.83891707250926]
We propose a novel KD method that works by modeling the information flow through the various layers of the teacher model. The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process.
arXiv Detail & Related papers (2020-05-02T06:56:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.