Comparison of Soft and Hard Target RNN-T Distillation for Large-scale
ASR
- URL: http://arxiv.org/abs/2210.05793v1
- Date: Tue, 11 Oct 2022 21:32:34 GMT
- Title: Comparison of Soft and Hard Target RNN-T Distillation for Large-scale
ASR
- Authors: Dongseong Hwang, Khe Chai Sim, Yu Zhang, Trevor Strohman
- Abstract summary: We focus on knowledge distillation for the RNN-T model, which is widely used in state-of-the-art (SoTA) automatic speech recognition (ASR)
We found that hard tar-gets are more effective when the teacher and student have different architecture, such as large teacher and small streaming student.
For a large model with0.6B weights, we achieve a new SoTA word error rate (WER) on LibriSpeech using Noisy Student Training with soft target distillation.
- Score: 12.953149757081025
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge distillation is an effective machine learning technique to transfer
knowledge from a teacher model to a smaller student model, especially with
unlabeled data. In this paper, we focus on knowledge distillation for the RNN-T
model, which is widely used in state-of-the-art (SoTA) automatic speech
recognition (ASR). Specifically, we compared using soft and hard target
distillation to train large-scaleRNN-T models on the LibriSpeech/LibriLight
public dataset (60k hours) and our in-house data (600k hours). We found that
hard tar-gets are more effective when the teacher and student have different
architecture, such as large teacher and small streaming student. On the other
hand, soft target distillation works better in self-training scenario like
iterative large teacher training. For a large model with0.6B weights, we
achieve a new SoTA word error rate (WER) on LibriSpeech (8% relative
improvement on dev-other) using Noisy Student Training with soft target
distillation. It also allows our production teacher to adapt new data domain
continuously.
Related papers
- ReffAKD: Resource-efficient Autoencoder-based Knowledge Distillation [3.301728339780329]
We propose an innovative method to boost Knowledge Distillation efficiency without the need for resource-heavy teacher models.
In our work, we propose an efficient method for generating soft labels, thereby eliminating the need for a large teacher model.
Our experiments on various datasets, including CIFAR-100, Tiny Imagenet, and Fashion MNIST, demonstrate the superior resource efficiency of our approach.
arXiv Detail & Related papers (2024-04-15T15:54:30Z) - Distilling Step-by-Step! Outperforming Larger Language Models with Less
Training Data and Smaller Model Sizes [91.58845026796149]
We introduce Distilling step-by-step, a new mechanism that trains small models that outperform large language models.
We present three findings across 4 NLP benchmarks.
arXiv Detail & Related papers (2023-05-03T17:50:56Z) - DisWOT: Student Architecture Search for Distillation WithOut Training [0.0]
We explore a novel training-free framework to search for the best student architectures for a given teacher.
Our work first empirically show that the optimal model under vanilla training cannot be the winner in distillation.
Our experiments on CIFAR, ImageNet and NAS-Bench-201 demonstrate that our technique achieves state-of-the-art results on different search spaces.
arXiv Detail & Related papers (2023-03-28T01:58:45Z) - Robust Knowledge Distillation from RNN-T Models With Noisy Training
Labels Using Full-Sum Loss [32.816725317261934]
This work studies knowledge distillation (KD) and addresses its constraints for recurrent neural network transducer (RNN-T) models.
We show that a sequence-level KD, full-sum distillation, outperforms other distillation methods for RNN-T models.
We also propose a variant of full-sum distillation that distills the sequence discriminative knowledge of the teacher leading to further improvement in WER.
arXiv Detail & Related papers (2023-03-10T14:46:23Z) - EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - LEAD: Liberal Feature-based Distillation for Dense Retrieval [67.48820723639601]
Knowledge distillation is often used to transfer knowledge from a strong teacher model to a relatively weak student model.
Traditional methods include response-based methods and feature-based methods.
In this paper, we propose a liberal feature-based distillation method (LEAD)
arXiv Detail & Related papers (2022-12-10T06:30:54Z) - Sparse Distillation: Speeding Up Text Classification by Using Bigger
Models [49.8019791766848]
Distilling state-of-the-art transformer models into lightweight student models is an effective way to reduce computation cost at inference time.
In this paper, we aim to further push the limit of inference speed by exploring a new area in the design space of the student model.
Our experiments show that the student models retain 97% of the RoBERTa-Large teacher performance on a collection of six text classification tasks.
arXiv Detail & Related papers (2021-10-16T10:04:14Z) - Data Distillation for Text Classification [7.473576666437028]
Data distillation aims to distill the knowledge from a large training dataset down to a smaller and synthetic one.
We develop a novel data distillation method for text classification.
The results that the distilled data with the size of 0.1% of the original text data achieves approximately 90% performance of the original is rather impressive.
arXiv Detail & Related papers (2021-04-17T04:54:54Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z) - MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models.
We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student.
Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.