Related papers: Transformer Based Deliberation for Two-Pass Speech Recognition

Transformer Based Deliberation for Two-Pass Speech Recognition

URL: http://arxiv.org/abs/2101.11577v1
Date: Wed, 27 Jan 2021 18:05:22 GMT
Title: Transformer Based Deliberation for Two-Pass Speech Recognition
Authors: Ke Hu, Ruoming Pang, Tara N. Sainath, Trevor Strohman
Abstract summary: Speech recognition systems must generate words quickly while also producing accurate results. Two-pass models excel at these requirements by employing a first-pass decoder that quickly emits words, and a second-pass decoder that requires more context but is more accurate. Previous work has established that a deliberation network can be an effective second-pass model.
Score: 46.86118010771703
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Interactive speech recognition systems must generate words quickly while also producing accurate results. Two-pass models excel at these requirements by employing a first-pass decoder that quickly emits words, and a second-pass decoder that requires more context but is more accurate. Previous work has established that a deliberation network can be an effective second-pass model. The model attends to two kinds of inputs at once: encoded audio frames and the hypothesis text from the first-pass model. In this work, we explore using transformer layers instead of long-short term memory (LSTM) layers for deliberation rescoring. In transformer layers, we generalize the "encoder-decoder" attention to attend to both encoded audio and first-pass text hypotheses. The output context vectors are then combined by a merger layer. Compared to LSTM-based deliberation, our best transformer deliberation achieves 7% relative word error rate improvements along with a 38% reduction in computation. We also compare against non-deliberation transformer rescoring, and find a 9% relative improvement.

Related papers

UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units. We enhance the model performance by subword prediction in the first-pass decoder. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z)
Joint Audio/Text Training for Transformer Rescorer of Streaming Speech Recognition [13.542483062256109]
We present our Joint Audio/Text training method for Transformer Rescorer. Our training method can improve word error rate (WER) significantly compared to standard Transformer Rescorer.
arXiv Detail & Related papers (2022-10-31T22:38:28Z)
Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes. The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z)
Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations [22.40667024030858]
Bi-encoders produce fixed-dimensional sentence representations and are computationally efficient. Cross-encoders can leverage their attention heads to exploit inter-sentence interactions for better performance. Trans-Encoder combines the two learning paradigms into an iterative joint framework to simultaneously learn enhanced bi- and cross-encoders.
arXiv Detail & Related papers (2021-09-27T14:06:47Z)
Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection [15.525314212209562]
We propose three latency reduction techniques for chunk-based incremental inference. We show that our approach is also applicable to low-latency speech translation.
arXiv Detail & Related papers (2020-05-22T13:42:54Z)
On Sparsifying Encoder Outputs in Sequence-to-Sequence Models [90.58793284654692]
We take Transformer as the testbed and introduce a layer of gates in-between the encoder and the decoder. The gates are regularized using the expected value of the sparsity-inducing L0penalty. We investigate the effects of this sparsification on two machine translation and two summarization tasks.
arXiv Detail & Related papers (2020-04-24T16:57:52Z)
Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model. The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses. A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z)
Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss [14.755108017449295]
We present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently. We present results on the LibriSpeech dataset showing that limiting the left context for self-attention makes decoding computationally tractable for streaming.
arXiv Detail & Related papers (2020-02-07T00:04:04Z)
Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR. We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism. Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.