Knowledge Distillation from Non-streaming to Streaming ASR Encoder using
Auxiliary Non-streaming Layer
- URL: http://arxiv.org/abs/2308.16415v1
- Date: Thu, 31 Aug 2023 02:58:33 GMT
- Title: Knowledge Distillation from Non-streaming to Streaming ASR Encoder using
Auxiliary Non-streaming Layer
- Authors: Kyuhong Shim, Jinkyu Lee, Simyung Chang, Kyuwoong Hwang
- Abstract summary: Streaming automatic speech recognition (ASR) models are restricted from accessing future context.
Knowledge distillation (KD) from the non-streaming to streaming model has been studied.
We propose a layer-to-layer KD from the teacher encoder to the student encoder.
- Score: 14.011579203058574
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Streaming automatic speech recognition (ASR) models are restricted from
accessing future context, which results in worse performance compared to the
non-streaming models. To improve the performance of streaming ASR, knowledge
distillation (KD) from the non-streaming to streaming model has been studied,
mainly focusing on aligning the output token probabilities. In this paper, we
propose a layer-to-layer KD from the teacher encoder to the student encoder. To
ensure that features are extracted using the same context, we insert auxiliary
non-streaming branches to the student and perform KD from the non-streaming
teacher layer to the non-streaming auxiliary layer. We design a special KD loss
that leverages the autoregressive predictive coding (APC) mechanism to
encourage the streaming model to predict unseen future contexts. Experimental
results show that the proposed method can significantly reduce the word error
rate compared to previous token probability distillation methods.
Related papers
- Sample what you cant compress [6.24979299238534]
We show how to learn a continuous encoder and decoder under a diffusion-based loss.
This approach yields better reconstruction quality as compared to GAN-based autoencoders.
We also show that the resulting representation is easier to model with a latent diffusion model as compared to the representation obtained from a state-of-the-art GAN-based loss.
arXiv Detail & Related papers (2024-09-04T08:42:42Z) - SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer [102.39050180060913]
Diffusion Transformer (DiT) has emerged as the new trend of generative diffusion models on image generation.
Recent breakthroughs have been driven by mask strategy that significantly improves the training efficiency of DiT with additional intra-image contextual learning.
In this work, we address these limitations by novelly unleashing the self-supervised discrimination knowledge to boost DiT training.
arXiv Detail & Related papers (2024-03-25T17:59:35Z) - Semi-Autoregressive Streaming ASR With Label Context [70.76222767090638]
We propose a streaming "semi-autoregressive" ASR model that incorporates the labels emitted in previous blocks as additional context.
Experiments show that our method outperforms the existing streaming NAR model by 19% relative on Tedlium2, 16%/8% on Librispeech-100 clean/other test sets, and 19%/8% on the Switchboard(SWB)/Callhome(CH) test sets.
arXiv Detail & Related papers (2023-09-19T20:55:58Z) - Knowledge Diffusion for Distillation [53.908314960324915]
The representation gap between teacher and student is an emerging topic in knowledge distillation (KD)
We state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature.
We propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models.
arXiv Detail & Related papers (2023-05-25T04:49:34Z) - Denoising Diffusion Autoencoders are Unified Self-supervised Learners [58.194184241363175]
This paper shows that the networks in diffusion models, namely denoising diffusion autoencoders (DDAE), are unified self-supervised learners.
DDAE has already learned strongly linear-separable representations within its intermediate layers without auxiliary encoders.
Our diffusion-based approach achieves 95.9% and 50.0% linear evaluation accuracies on CIFAR-10 and Tiny-ImageNet.
arXiv Detail & Related papers (2023-03-17T04:20:47Z) - Streaming Align-Refine for Non-autoregressive Deliberation [42.748839817396046]
We propose a streaming non-autoregressive (non-AR) decoding algorithm to deliberate the hypothesis alignment of a streaming RNN-T model.
Our algorithm facilitates a simple greedy decoding procedure, and at the same time is capable of producing the decoding result at each frame with limited right context.
Experiments on voice search datasets and Librispeech show that with reasonable right context, our streaming model performs as well as the offline counterpart.
arXiv Detail & Related papers (2022-04-15T17:24:39Z) - EvDistill: Asynchronous Events to End-task Learning via Bidirectional
Reconstruction-guided Cross-modal Knowledge Distillation [61.33010904301476]
Event cameras sense per-pixel intensity changes and produce asynchronous event streams with high dynamic range and less motion blur.
We propose a novel approach, called bfEvDistill, to learn a student network on the unlabeled and unpaired event data.
We show that EvDistill achieves significantly better results than the prior works and KD with only events and APS frames.
arXiv Detail & Related papers (2021-11-24T08:48:16Z) - An Investigation of Enhancing CTC Model for Triggered Attention-based
Streaming ASR [19.668440671541546]
An attempt is made to combine Mask-CTC and the triggered attention mechanism to construct a streaming end-to-end automatic speech recognition (ASR) system.
The proposed method achieves higher accuracy with lower latency than the conventional triggered attention-based streaming ASR system.
arXiv Detail & Related papers (2021-10-20T06:44:58Z) - Bridging the gap between streaming and non-streaming ASR systems
bydistilling ensembles of CTC and RNN-T models [34.002281923671795]
Streaming end-to-end automatic speech recognition systems are widely used in everyday applications that require transcribing speech to text in real-time.
Unlike their non-streaming counterparts, streaming models are constrained to be causal with no future context and suffer from higher word error rates (WER)
To improve streaming models, a recent study proposed to distill a non-streaming teacher model on unsupervised utterances, and then train a streaming student using the teachers' predictions.
In this paper, we aim to close this gap by using a diversified set of non-streaming teacher models and combining them using Recognizer Output Voting Error Reduction (R
arXiv Detail & Related papers (2021-04-25T19:20:34Z) - Cascaded encoders for unifying streaming and non-streaming ASR [68.62941009369125]
This work presents cascaded encoders for building a single E2E ASR model that can operate in both these modes simultaneously.
A single decoder then learns to decode either using the output of the streaming or the non-streaming encoder.
Results show that this model achieves similar word error rates (WER) as a standalone streaming model when operating in streaming mode, and obtains 10% -- 27% relative improvement when operating in non-streaming mode.
arXiv Detail & Related papers (2020-10-27T20:59:50Z) - Improving Streaming Automatic Speech Recognition With Non-Streaming
Model Distillation On Unsupervised Data [44.48235209327319]
Streaming end-to-end automatic speech recognition models are widely used on smart speakers and on-device applications.
We propose a novel and effective learning method by leveraging a non-streaming ASR model as a teacher.
We scale the training of streaming models to up to 3 million hours of YouTube audio.
arXiv Detail & Related papers (2020-10-22T22:41:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.