Bridging the gap between streaming and non-streaming ASR systems
bydistilling ensembles of CTC and RNN-T models
- URL: http://arxiv.org/abs/2104.14346v1
- Date: Sun, 25 Apr 2021 19:20:34 GMT
- Title: Bridging the gap between streaming and non-streaming ASR systems
bydistilling ensembles of CTC and RNN-T models
- Authors: Thibault Doutre, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Olivier
Siohan, Liangliang Cao
- Abstract summary: Streaming end-to-end automatic speech recognition systems are widely used in everyday applications that require transcribing speech to text in real-time.
Unlike their non-streaming counterparts, streaming models are constrained to be causal with no future context and suffer from higher word error rates (WER)
To improve streaming models, a recent study proposed to distill a non-streaming teacher model on unsupervised utterances, and then train a streaming student using the teachers' predictions.
In this paper, we aim to close this gap by using a diversified set of non-streaming teacher models and combining them using Recognizer Output Voting Error Reduction (R
- Score: 34.002281923671795
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Streaming end-to-end automatic speech recognition (ASR) systems are widely
used in everyday applications that require transcribing speech to text in
real-time. Their minimal latency makes them suitable for such tasks. Unlike
their non-streaming counterparts, streaming models are constrained to be causal
with no future context and suffer from higher word error rates (WER). To
improve streaming models, a recent study [1] proposed to distill a
non-streaming teacher model on unsupervised utterances, and then train a
streaming student using the teachers' predictions. However, the performance gap
between teacher and student WERs remains high. In this paper, we aim to close
this gap by using a diversified set of non-streaming teacher models and
combining them using Recognizer Output Voting Error Reduction (ROVER). In
particular, we show that, despite being weaker than RNN-T models, CTC models
are remarkable teachers. Further, by fusing RNN-T and CTC models together, we
build the strongest teachers. The resulting student models drastically improve
upon streaming models of previous work [1]: the WER decreases by 41% on
Spanish, 27% on Portuguese, and 13% on French.
Related papers
- Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Semi-Autoregressive Streaming ASR With Label Context [70.76222767090638]
We propose a streaming "semi-autoregressive" ASR model that incorporates the labels emitted in previous blocks as additional context.
Experiments show that our method outperforms the existing streaming NAR model by 19% relative on Tedlium2, 16%/8% on Librispeech-100 clean/other test sets, and 19%/8% on the Switchboard(SWB)/Callhome(CH) test sets.
arXiv Detail & Related papers (2023-09-19T20:55:58Z) - Knowledge Distillation from Non-streaming to Streaming ASR Encoder using
Auxiliary Non-streaming Layer [14.011579203058574]
Streaming automatic speech recognition (ASR) models are restricted from accessing future context.
Knowledge distillation (KD) from the non-streaming to streaming model has been studied.
We propose a layer-to-layer KD from the teacher encoder to the student encoder.
arXiv Detail & Related papers (2023-08-31T02:58:33Z) - Improving Streaming Automatic Speech Recognition With Non-Streaming
Model Distillation On Unsupervised Data [44.48235209327319]
Streaming end-to-end automatic speech recognition models are widely used on smart speakers and on-device applications.
We propose a novel and effective learning method by leveraging a non-streaming ASR model as a teacher.
We scale the training of streaming models to up to 3 million hours of YouTube audio.
arXiv Detail & Related papers (2020-10-22T22:41:33Z) - RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and
Solutions [73.45995446500312]
We analyze the generalization properties of streaming and non-streaming recurrent neural network transducer (RNN-T) based end-to-end models.
We propose two solutions: combining multiple regularization techniques during training, and using dynamic overlapping inference.
arXiv Detail & Related papers (2020-05-07T06:24:47Z) - A Streaming On-Device End-to-End Model Surpassing Server-Side
Conventional Model Quality and Latency [88.08721721440429]
We develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer.
We find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model.
arXiv Detail & Related papers (2020-03-28T05:00:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.