Transformer Based Deliberation for Two-Pass Speech Recognition
- URL: http://arxiv.org/abs/2101.11577v1
- Date: Wed, 27 Jan 2021 18:05:22 GMT
- Title: Transformer Based Deliberation for Two-Pass Speech Recognition
- Authors: Ke Hu, Ruoming Pang, Tara N. Sainath, Trevor Strohman
- Abstract summary: Speech recognition systems must generate words quickly while also producing accurate results.
Two-pass models excel at these requirements by employing a first-pass decoder that quickly emits words, and a second-pass decoder that requires more context but is more accurate.
Previous work has established that a deliberation network can be an effective second-pass model.
- Score: 46.86118010771703
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Interactive speech recognition systems must generate words quickly while also
producing accurate results. Two-pass models excel at these requirements by
employing a first-pass decoder that quickly emits words, and a second-pass
decoder that requires more context but is more accurate. Previous work has
established that a deliberation network can be an effective second-pass model.
The model attends to two kinds of inputs at once: encoded audio frames and the
hypothesis text from the first-pass model. In this work, we explore using
transformer layers instead of long-short term memory (LSTM) layers for
deliberation rescoring. In transformer layers, we generalize the
"encoder-decoder" attention to attend to both encoded audio and first-pass text
hypotheses. The output context vectors are then combined by a merger layer.
Compared to LSTM-based deliberation, our best transformer deliberation achieves
7% relative word error rate improvements along with a 38% reduction in
computation. We also compare against non-deliberation transformer rescoring,
and find a 9% relative improvement.
Related papers
- UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units.
We enhance the model performance by subword prediction in the first-pass decoder.
We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z) - Joint Audio/Text Training for Transformer Rescorer of Streaming Speech
Recognition [13.542483062256109]
We present our Joint Audio/Text training method for Transformer Rescorer.
Our training method can improve word error rate (WER) significantly compared to standard Transformer Rescorer.
arXiv Detail & Related papers (2022-10-31T22:38:28Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - Trans-Encoder: Unsupervised sentence-pair modelling through self- and
mutual-distillations [22.40667024030858]
Bi-encoders produce fixed-dimensional sentence representations and are computationally efficient.
Cross-encoders can leverage their attention heads to exploit inter-sentence interactions for better performance.
Trans-Encoder combines the two learning paradigms into an iterative joint framework to simultaneously learn enhanced bi- and cross-encoders.
arXiv Detail & Related papers (2021-09-27T14:06:47Z) - Low-Latency Sequence-to-Sequence Speech Recognition and Translation by
Partial Hypothesis Selection [15.525314212209562]
We propose three latency reduction techniques for chunk-based incremental inference.
We show that our approach is also applicable to low-latency speech translation.
arXiv Detail & Related papers (2020-05-22T13:42:54Z) - On Sparsifying Encoder Outputs in Sequence-to-Sequence Models [90.58793284654692]
We take Transformer as the testbed and introduce a layer of gates in-between the encoder and the decoder.
The gates are regularized using the expected value of the sparsity-inducing L0penalty.
We investigate the effects of this sparsification on two machine translation and two summarization tasks.
arXiv Detail & Related papers (2020-04-24T16:57:52Z) - Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z) - Transformer Transducer: A Streamable Speech Recognition Model with
Transformer Encoders and RNN-T Loss [14.755108017449295]
We present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system.
Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently.
We present results on the LibriSpeech dataset showing that limiting the left context for self-attention makes decoding computationally tractable for streaming.
arXiv Detail & Related papers (2020-02-07T00:04:04Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.