Robust Knowledge Distillation from RNN-T Models With Noisy Training
Labels Using Full-Sum Loss
- URL: http://arxiv.org/abs/2303.05958v1
- Date: Fri, 10 Mar 2023 14:46:23 GMT
- Title: Robust Knowledge Distillation from RNN-T Models With Noisy Training
Labels Using Full-Sum Loss
- Authors: Mohammad Zeineldeen, Kartik Audhkhasi, Murali Karthick Baskar, Bhuvana
Ramabhadran
- Abstract summary: This work studies knowledge distillation (KD) and addresses its constraints for recurrent neural network transducer (RNN-T) models.
We show that a sequence-level KD, full-sum distillation, outperforms other distillation methods for RNN-T models.
We also propose a variant of full-sum distillation that distills the sequence discriminative knowledge of the teacher leading to further improvement in WER.
- Score: 32.816725317261934
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work studies knowledge distillation (KD) and addresses its constraints
for recurrent neural network transducer (RNN-T) models. In hard distillation, a
teacher model transcribes large amounts of unlabelled speech to train a student
model. Soft distillation is another popular KD method that distills the output
logits of the teacher model. Due to the nature of RNN-T alignments, applying
soft distillation between RNN-T architectures having different posterior
distributions is challenging. In addition, bad teachers having high
word-error-rate (WER) reduce the efficacy of KD. We investigate how to
effectively distill knowledge from variable quality ASR teachers, which has not
been studied before to the best of our knowledge. We show that a sequence-level
KD, full-sum distillation, outperforms other distillation methods for RNN-T
models, especially for bad teachers. We also propose a variant of full-sum
distillation that distills the sequence discriminative knowledge of the teacher
leading to further improvement in WER. We conduct experiments on public
datasets namely SpeechStew and LibriSpeech, and on in-house production data.
Related papers
- Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Knowledge Distillation with Refined Logits [31.205248790623703]
We introduce Refined Logit Distillation (RLD) to address the limitations of current logit distillation methods.
Our approach is motivated by the observation that even high-performing teacher models can make incorrect predictions.
Our method can effectively eliminate misleading information from the teacher while preserving crucial class correlations.
arXiv Detail & Related papers (2024-08-14T17:59:32Z) - Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation.
At the sequence level, we propose a sequence correction and re-generation strategy.
At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function.
At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z) - Knowledge Diffusion for Distillation [53.908314960324915]
The representation gap between teacher and student is an emerging topic in knowledge distillation (KD)
We state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature.
We propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models.
arXiv Detail & Related papers (2023-05-25T04:49:34Z) - Unbiased Knowledge Distillation for Recommendation [66.82575287129728]
Knowledge distillation (KD) has been applied in recommender systems (RS) to reduce inference latency.
Traditional solutions first train a full teacher model from the training data, and then transfer its knowledge to supervise the learning of a compact student model.
We find such a standard distillation paradigm would incur serious bias issue -- popular items are more heavily recommended after the distillation.
arXiv Detail & Related papers (2022-11-27T05:14:03Z) - Comparison of Soft and Hard Target RNN-T Distillation for Large-scale
ASR [12.953149757081025]
We focus on knowledge distillation for the RNN-T model, which is widely used in state-of-the-art (SoTA) automatic speech recognition (ASR)
We found that hard tar-gets are more effective when the teacher and student have different architecture, such as large teacher and small streaming student.
For a large model with0.6B weights, we achieve a new SoTA word error rate (WER) on LibriSpeech using Noisy Student Training with soft target distillation.
arXiv Detail & Related papers (2022-10-11T21:32:34Z) - Exploring Inconsistent Knowledge Distillation for Object Detection with
Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model.
We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions.
Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z) - Dynamic Rectification Knowledge Distillation [0.0]
Dynamic Rectification Knowledge Distillation (DR-KD) is a knowledge distillation framework.
DR-KD transforms the student into its own teacher, and if the self-teacher makes wrong predictions while distilling information, the error is rectified prior to the knowledge being distilled.
Our proposed DR-KD performs remarkably well in the absence of a sophisticated cumbersome teacher model.
arXiv Detail & Related papers (2022-01-27T04:38:01Z) - On Self-Distilling Graph Neural Network [64.00508355508106]
We propose the first teacher-free knowledge distillation method for GNNs, termed GNN Self-Distillation (GNN-SD)
The method is built upon the proposed neighborhood discrepancy rate (NDR), which quantifies the non-smoothness of the embedded graph in an efficient way.
We also summarize a generic GNN-SD framework that could be exploited to induce other distillation strategies.
arXiv Detail & Related papers (2020-11-04T12:29:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.