Ensemble knowledge distillation of self-supervised speech models
- URL: http://arxiv.org/abs/2302.12757v1
- Date: Fri, 24 Feb 2023 17:15:39 GMT
- Title: Ensemble knowledge distillation of self-supervised speech models
- Authors: Kuan-Po Huang, Tzu-hsun Feng, Yu-Kuan Fu, Tsu-Yuan Hsu, Po-Chieh Yen,
Wei-Cheng Tseng, Kai-Wei Chang, Hung-yi Lee
- Abstract summary: Distilled self-supervised models have shown competitive performance and efficiency in recent years.
We performed Ensemble Knowledge Distillation (EKD) on various self-supervised speech models such as HuBERT, RobustHuBERT, and WavLM.
Our method improves the performance of the distilled models on four downstream speech processing tasks.
- Score: 84.69577440755457
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Distilled self-supervised models have shown competitive performance and
efficiency in recent years. However, there is a lack of experience in jointly
distilling multiple self-supervised speech models. In our work, we performed
Ensemble Knowledge Distillation (EKD) on various self-supervised speech models
such as HuBERT, RobustHuBERT, and WavLM. We tried two different aggregation
techniques, layerwise-average and layerwise-concatenation, to the
representations of different teacher models and found that the former was more
effective. On top of that, we proposed a multiple prediction head method for
student models to predict different layer outputs of multiple teacher models
simultaneously. The experimental results show that our method improves the
performance of the distilled models on four downstream speech processing tasks,
Phoneme Recognition, Speaker Identification, Emotion Recognition, and Automatic
Speech Recognition in the hidden-set track of the SUPERB benchmark.
Related papers
- Interactive DualChecker for Mitigating Hallucinations in Distilling Large Language Models [7.632217365130212]
Large Language Models (LLMs) have demonstrated exceptional capabilities across various machine learning (ML) tasks.
These models can produce hallucinations, particularly in domains with incomplete knowledge.
We introduce DualChecker, an innovative framework designed to mitigate hallucinations and improve the performance of both teacher and student models.
arXiv Detail & Related papers (2024-08-22T12:04:04Z) - DinoSR: Self-Distillation and Online Clustering for Self-supervised
Speech Representation Learning [140.96990096377127]
We introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR)
DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network.
We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units.
arXiv Detail & Related papers (2023-05-17T07:23:46Z) - Multi-Mode Online Knowledge Distillation for Self-Supervised Visual
Representation Learning [13.057037169495594]
We propose a Multi-mode Online Knowledge Distillation method (MOKD) to boost self-supervised visual representation learning.
In MOKD, two different models learn collaboratively in a self-supervised manner.
In addition, MOKD also outperforms existing SSL-KD methods for both the student and teacher models.
arXiv Detail & Related papers (2023-04-13T12:55:53Z) - Self-Supervised Monocular Depth Estimation with Self-Reference
Distillation and Disparity Offset Refinement [15.012694052674899]
We propose two novel ideas to improve self-supervised monocular depth estimation.
We use a parameter-optimized model as the teacher updated as the training epochs to provide additional supervision.
We leverage the contextual consistency between high-scale and low-scale features to obtain multiscale disparity offsets.
arXiv Detail & Related papers (2023-02-20T06:28:52Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - Self-paced ensemble learning for speech and audio classification [19.39192082485334]
We propose a self-paced ensemble learning scheme in which models learn from each other over several iterations.
During the self-paced learning process, our ensemble also gains knowledge about the target domain.
Our empirical results indicate that SPEL significantly outperforms the baseline ensemble models.
arXiv Detail & Related papers (2021-03-22T16:34:06Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z) - TERA: Self-Supervised Learning of Transformer Encoder Representation for
Speech [63.03318307254081]
TERA stands for Transformer Representations from Alteration.
We use alteration along three axes to pre-train Transformers on a large amount of unlabeled speech.
TERA can be used for speech representations extraction or fine-tuning with downstream models.
arXiv Detail & Related papers (2020-07-12T16:19:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.