Deliberation Model Based Two-Pass End-to-End Speech Recognition
- URL: http://arxiv.org/abs/2003.07962v1
- Date: Tue, 17 Mar 2020 22:01:12 GMT
- Title: Deliberation Model Based Two-Pass End-to-End Speech Recognition
- Authors: Ke Hu, Tara N. Sainath, Ruoming Pang, Rohit Prabhavalkar
- Abstract summary: A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
- Score: 52.45841282906516
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end (E2E) models have made rapid progress in automatic speech
recognition (ASR) and perform competitively relative to conventional models. To
further improve the quality, a two-pass model has been proposed to rescore
streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS)
model while maintaining a reasonable latency. The model attends to acoustics to
rescore hypotheses, as opposed to a class of neural correction models that use
only first-pass text hypotheses. In this work, we propose to attend to both
acoustics and first-pass hypotheses using a deliberation network. A
bidirectional encoder is used to extract context information from first-pass
hypotheses. The proposed deliberation model achieves 12% relative WER reduction
compared to LAS rescoring in Google Voice Search (VS) tasks, and 23% reduction
on a proper noun test set. Compared to a large conventional model, our best
model performs 21% relatively better for VS. In terms of computational
complexity, the deliberation decoder has a larger size than the LAS decoder,
and hence requires more computations in second-pass decoding.
Related papers
- Transfer Learning from Pre-trained Language Models Improves End-to-End
Speech Summarization [48.35495352015281]
End-to-end speech summarization (E2E SSum) directly summarizes input speech into easy-to-read short sentences with a single model.
Due to the high cost of collecting speech-summary pairs, an E2E SSum model tends to suffer from training data scarcity and output unnatural sentences.
We propose for the first time to integrate a pre-trained language model (LM) into the E2E SSum decoder via transfer learning.
arXiv Detail & Related papers (2023-06-07T08:23:58Z) - UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units.
We enhance the model performance by subword prediction in the first-pass decoder.
We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z) - On Comparison of Encoders for Attention based End to End Speech
Recognition in Standalone and Rescoring Mode [1.7704011486040847]
Non-streaming models provide better performance as they look at the entire audio context.
We show that the Transformer model offers acceptable WER with the lowest latency requirements.
We highlight the importance of CNN front-end with Transformer architecture to achieve comparable word error rates (WER)
arXiv Detail & Related papers (2022-06-26T09:12:27Z) - A Conformer Based Acoustic Model for Robust Automatic Speech Recognition [63.242128956046024]
The proposed model builds on a state-of-the-art recognition system using a bi-directional long short-term memory (BLSTM) model with utterance-wise dropout and iterative speaker adaptation.
The Conformer encoder uses a convolution-augmented attention mechanism for acoustic modeling.
The proposed system is evaluated on the monaural ASR task of the CHiME-4 corpus.
arXiv Detail & Related papers (2022-03-01T20:17:31Z) - Transformer Based Deliberation for Two-Pass Speech Recognition [46.86118010771703]
Speech recognition systems must generate words quickly while also producing accurate results.
Two-pass models excel at these requirements by employing a first-pass decoder that quickly emits words, and a second-pass decoder that requires more context but is more accurate.
Previous work has established that a deliberation network can be an effective second-pass model.
arXiv Detail & Related papers (2021-01-27T18:05:22Z) - Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for
Low-resource Speech Recognition [9.732767611907068]
In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model.
Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.
arXiv Detail & Related papers (2021-01-17T16:12:44Z) - Hypothesis Stitcher for End-to-End Speaker-attributed ASR on Long-form
Multi-talker Recordings [42.17790794610591]
An end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed recently to jointly perform speaker counting, speech recognition and speaker identification.
The model achieved a low speaker-attributed word error rate (SA-WER) for monaural overlapped speech comprising an unknown number of speakers.
It has yet to be investigated whether the E2E SA-ASR model works well for recordings that are much longer than samples seen during training.
arXiv Detail & Related papers (2021-01-06T03:36:09Z) - Efficient End-to-End Speech Recognition Using Performers in Conformers [74.71219757585841]
We propose to reduce the complexity of model architectures in addition to model sizes.
The proposed model yields competitive performance on the LibriSpeech corpus with 10 millions of parameters and linear complexity.
arXiv Detail & Related papers (2020-11-09T05:22:57Z) - Parallel Rescoring with Transformer for Streaming On-Device Speech
Recognition [36.86458309520383]
Two-pass model provides better speed-quality trade-offs for on-device speech recognition.
The 2nd-pass model plays a key role in the quality improvement of the end-to-end model to surpass the conventional model.
In this work we explore replacing the LSTM layers in the 2nd-pass rescorer with Transformer layers, which can process the entire hypothesis sequences in parallel.
arXiv Detail & Related papers (2020-08-30T05:17:31Z) - A Streaming On-Device End-to-End Model Surpassing Server-Side
Conventional Model Quality and Latency [88.08721721440429]
We develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer.
We find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model.
arXiv Detail & Related papers (2020-03-28T05:00:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.