Improving Streaming Automatic Speech Recognition With Non-Streaming
Model Distillation On Unsupervised Data
- URL: http://arxiv.org/abs/2010.12096v2
- Date: Sun, 21 Feb 2021 21:56:06 GMT
- Title: Improving Streaming Automatic Speech Recognition With Non-Streaming
Model Distillation On Unsupervised Data
- Authors: Thibault Doutre, Wei Han, Min Ma, Zhiyun Lu, Chung-Cheng Chiu, Ruoming
Pang, Arun Narayanan, Ananya Misra, Yu Zhang, Liangliang Cao
- Abstract summary: Streaming end-to-end automatic speech recognition models are widely used on smart speakers and on-device applications.
We propose a novel and effective learning method by leveraging a non-streaming ASR model as a teacher.
We scale the training of streaming models to up to 3 million hours of YouTube audio.
- Score: 44.48235209327319
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Streaming end-to-end automatic speech recognition (ASR) models are widely
used on smart speakers and on-device applications. Since these models are
expected to transcribe speech with minimal latency, they are constrained to be
causal with no future context, compared to their non-streaming counterparts.
Consequently, streaming models usually perform worse than non-streaming models.
We propose a novel and effective learning method by leveraging a non-streaming
ASR model as a teacher to generate transcripts on an arbitrarily large data
set, which is then used to distill knowledge into streaming ASR models. This
way, we scale the training of streaming models to up to 3 million hours of
YouTube audio. Experiments show that our approach can significantly reduce the
word error rate (WER) of RNNT models not only on LibriSpeech but also on
YouTube data in four languages. For example, in French, we are able to reduce
the WER by 16.4% relatively to a baseline streaming model by leveraging a
non-streaming teacher model trained on the same amount of labeled data as the
baseline.
Related papers
- Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper [3.717584661565119]
We demonstrate that streaming Transformer-Transducer (TT) models can be trained from scratch without supervised data.
This allows training a robust ASR model just in one stage and does not require large data and computational budget.
We validate the proposed framework on 6 languages from CommonVoice and propose multiple filters to filter out hallucinated PLs.
arXiv Detail & Related papers (2024-09-20T13:38:59Z) - Pheme: Efficient and Conversational Speech Generation [52.34331755341856]
We introduce the Pheme model series that offers compact yet high-performing conversational TTS models.
It can be trained efficiently on smaller-scale conversational data, cutting data demands by more than 10x but still matching the quality of the autoregressive TTS models.
arXiv Detail & Related papers (2024-01-05T14:47:20Z) - Semi-Autoregressive Streaming ASR With Label Context [70.76222767090638]
We propose a streaming "semi-autoregressive" ASR model that incorporates the labels emitted in previous blocks as additional context.
Experiments show that our method outperforms the existing streaming NAR model by 19% relative on Tedlium2, 16%/8% on Librispeech-100 clean/other test sets, and 19%/8% on the Switchboard(SWB)/Callhome(CH) test sets.
arXiv Detail & Related papers (2023-09-19T20:55:58Z) - Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low
Resource Languages [15.32264927462068]
We propose an unsupervised pre-training method for a sequence-to-sequence TTS model by leveraging large untranscribed speech data.
The main idea is to pre-train the model to reconstruct de-warped mel-spectrograms from warped ones.
We empirically demonstrate the effectiveness of our proposed method in low-resource language scenarios.
arXiv Detail & Related papers (2023-03-28T01:26:00Z) - Dual Learning for Large Vocabulary On-Device ASR [64.10124092250128]
Dual learning is a paradigm for semi-supervised machine learning that seeks to leverage unsupervised data by solving two opposite tasks at once.
We provide an analysis of an on-device-sized streaming conformer trained on the entirety of Librispeech, showing relative WER improvements of 10.7%/5.2% without an LM and 11.7%/16.4% with an LM.
arXiv Detail & Related papers (2023-01-11T06:32:28Z) - Distributionally Robust Recurrent Decoders with Random Network
Distillation [93.10261573696788]
We propose a method based on OOD detection with Random Network Distillation to allow an autoregressive language model to disregard OOD context during inference.
We apply our method to a GRU architecture, demonstrating improvements on multiple language modeling (LM) datasets.
arXiv Detail & Related papers (2021-10-25T19:26:29Z) - Bridging the gap between streaming and non-streaming ASR systems
bydistilling ensembles of CTC and RNN-T models [34.002281923671795]
Streaming end-to-end automatic speech recognition systems are widely used in everyday applications that require transcribing speech to text in real-time.
Unlike their non-streaming counterparts, streaming models are constrained to be causal with no future context and suffer from higher word error rates (WER)
To improve streaming models, a recent study proposed to distill a non-streaming teacher model on unsupervised utterances, and then train a streaming student using the teachers' predictions.
In this paper, we aim to close this gap by using a diversified set of non-streaming teacher models and combining them using Recognizer Output Voting Error Reduction (R
arXiv Detail & Related papers (2021-04-25T19:20:34Z) - Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for
Low-resource Speech Recognition [9.732767611907068]
In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model.
Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.
arXiv Detail & Related papers (2021-01-17T16:12:44Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.