Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with
Non-Autoregressive Hidden Intermediates
- URL: http://arxiv.org/abs/2109.12804v1
- Date: Mon, 27 Sep 2021 05:21:30 GMT
- Title: Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with
Non-Autoregressive Hidden Intermediates
- Authors: Hirofumi Inaguma, Siddharth Dalmia, Brian Yan, Shinji Watanabe
- Abstract summary: We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder.
Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
- Score: 59.678108707409606
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The multi-decoder (MD) end-to-end speech translation model has demonstrated
high translation quality by searching for better intermediate automatic speech
recognition (ASR) decoder states as hidden intermediates (HI). It is a two-pass
decoding model decomposing the overall task into ASR and machine translation
sub-tasks. However, the decoding speed is not fast enough for real-world
applications because it conducts beam search for both sub-tasks during
inference. We propose Fast-MD, a fast MD model that generates HI by
non-autoregressive (NAR) decoding based on connectionist temporal
classification (CTC) outputs followed by an ASR decoder. We investigated two
types of NAR HI: (1) parallel HI by using an autoregressive Transformer ASR
decoder and (2) masked HI by using Mask-CTC, which combines CTC and the
conditional masked language model. To reduce a mismatch in the ASR decoder
between teacher-forcing during training and conditioning on CTC outputs during
testing, we also propose sampling CTC outputs during training. Experimental
evaluations on three corpora show that Fast-MD achieved about 2x and 4x faster
decoding speed than that of the na\"ive MD model on GPU and CPU with comparable
translation quality. Adopting the Conformer encoder and intermediate CTC loss
further boosts its quality without sacrificing decoding speed.
Related papers
- Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition [20.052245837954175]
We propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture.
We introduce an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively during inference.
A hybrid CTC/RNNT architecture which utilizes a shared encoder with both a CTC and RNNT decoder to boost the accuracy and save computation.
arXiv Detail & Related papers (2023-12-27T21:04:26Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - Streaming parallel transducer beam search with fast-slow cascaded
encoders [23.416682253435837]
Streaming and non-streaming ASR for RNN Transducers can be unified by cascading causal and non-causal encoders.
We propose a novel parallel time-synchronous beam search algorithm for transducers that decodes from fast-slow encoders.
arXiv Detail & Related papers (2022-03-29T17:29:39Z) - Non-autoregressive End-to-end Speech Translation with Parallel
Autoregressive Rescoring [83.32560748324667]
This article describes an efficient end-to-end speech translation (E2E-ST) framework based on non-autoregressive (NAR) models.
We propose a unified NAR E2E-ST framework called Orthros, which has an NAR decoder and an auxiliary shallow AR decoder on top of the shared encoder.
arXiv Detail & Related papers (2021-09-09T16:50:16Z) - Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained
Models into Speech Translation Encoders [30.160261563657947]
Speech-to-translation data is scarce; pre-training is promising in end-to-end Speech Translation.
We propose a Stacked.
Acoustic-and-Textual (SATE) method for speech translation.
Our encoder begins with processing the acoustic sequence as usual, but later behaves more like an.
MT encoder for a global representation of the input sequence.
arXiv Detail & Related papers (2021-05-12T16:09:53Z) - Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input [54.82369261350497]
We propose a CTC-enhanced NAR transformer, which generates target sequence by refining predictions of the CTC module.
Experimental results show that our method outperforms all previous NAR counterparts and achieves 50x faster decoding speed than a strong AR baseline with only 0.0 0.3 absolute CER degradation on Aishell-1 and Aishell-2 datasets.
arXiv Detail & Related papers (2020-10-28T15:00:09Z) - Orthros: Non-autoregressive End-to-end Speech Translation with
Dual-decoder [64.55176104620848]
We propose a novel NAR E2E-ST framework, Orthros, in which both NAR and autoregressive (AR) decoders are jointly trained on the shared speech encoder.
The latter is used for selecting better translation among various length candidates generated from the former, which dramatically improves the effectiveness of a large length beam with negligible overhead.
Experiments on four benchmarks show the effectiveness of the proposed method in improving inference speed while maintaining competitive translation quality.
arXiv Detail & Related papers (2020-10-25T06:35:30Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.