Non-autoregressive End-to-end Speech Translation with Parallel
Autoregressive Rescoring
- URL: http://arxiv.org/abs/2109.04411v1
- Date: Thu, 9 Sep 2021 16:50:16 GMT
- Title: Non-autoregressive End-to-end Speech Translation with Parallel
Autoregressive Rescoring
- Authors: Hirofumi Inaguma, Yosuke Higuchi, Kevin Duh, Tatsuya Kawahara, Shinji
Watanabe
- Abstract summary: This article describes an efficient end-to-end speech translation (E2E-ST) framework based on non-autoregressive (NAR) models.
We propose a unified NAR E2E-ST framework called Orthros, which has an NAR decoder and an auxiliary shallow AR decoder on top of the shared encoder.
- Score: 83.32560748324667
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This article describes an efficient end-to-end speech translation (E2E-ST)
framework based on non-autoregressive (NAR) models. End-to-end speech
translation models have several advantages over traditional cascade systems
such as inference latency reduction. However, conventional AR decoding methods
are not fast enough because each token is generated incrementally. NAR models,
however, can accelerate the decoding speed by generating multiple tokens in
parallel on the basis of the token-wise conditional independence assumption. We
propose a unified NAR E2E-ST framework called Orthros, which has an NAR decoder
and an auxiliary shallow AR decoder on top of the shared encoder. The auxiliary
shallow AR decoder selects the best hypothesis by rescoring multiple candidates
generated from the NAR decoder in parallel (parallel AR rescoring). We adopt
conditional masked language model (CMLM) and a connectionist temporal
classification (CTC)-based model as NAR decoders for Orthros, referred to as
Orthros-CMLM and Orthros-CTC, respectively. We also propose two training
methods to enhance the CMLM decoder. Experimental evaluations on three
benchmark datasets with six language directions demonstrated that Orthros
achieved large improvements in translation quality with a very small overhead
compared with the baseline NAR model. Moreover, the Conformer encoder
architecture enabled large quality improvements, especially for CTC-based
models. Orthros-CTC with the Conformer encoder increased decoding speed by
3.63x on CPU with translation quality comparable to that of an AR model.
Related papers
- CTC-based Non-autoregressive Textless Speech-to-Speech Translation [38.99922762754443]
Direct speech-to-speech translation (S2ST) has achieved impressive translation quality, but it often faces the challenge of slow decoding.
Recently, some research has turned to non-autoregressive (NAR) models to expedite decoding, yet the translation quality typically lags behind autoregressive (AR) models significantly.
In this paper, we investigate the performance of CTC-based NAR models in S2ST, as these models have shown impressive results in machine translation.
arXiv Detail & Related papers (2024-06-11T15:00:33Z) - 4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders [53.297697898510194]
We propose a joint modeling scheme where four decoders share the same encoder -- we refer to this as 4D modeling.
To efficiently train the 4D model, we introduce a two-stage training strategy that stabilizes multitask learning.
In addition, we propose three novel one-pass beam search algorithms by combining three decoders.
arXiv Detail & Related papers (2024-06-05T05:18:20Z) - Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition [20.052245837954175]
We propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture.
We introduce an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively during inference.
A hybrid CTC/RNNT architecture which utilizes a shared encoder with both a CTC and RNNT decoder to boost the accuracy and save computation.
arXiv Detail & Related papers (2023-12-27T21:04:26Z) - Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with
Non-Autoregressive Hidden Intermediates [59.678108707409606]
We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder.
Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
arXiv Detail & Related papers (2021-09-27T05:21:30Z) - Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input [54.82369261350497]
We propose a CTC-enhanced NAR transformer, which generates target sequence by refining predictions of the CTC module.
Experimental results show that our method outperforms all previous NAR counterparts and achieves 50x faster decoding speed than a strong AR baseline with only 0.0 0.3 absolute CER degradation on Aishell-1 and Aishell-2 datasets.
arXiv Detail & Related papers (2020-10-28T15:00:09Z) - Orthros: Non-autoregressive End-to-end Speech Translation with
Dual-decoder [64.55176104620848]
We propose a novel NAR E2E-ST framework, Orthros, in which both NAR and autoregressive (AR) decoders are jointly trained on the shared speech encoder.
The latter is used for selecting better translation among various length candidates generated from the former, which dramatically improves the effectiveness of a large length beam with negligible overhead.
Experiments on four benchmarks show the effectiveness of the proposed method in improving inference speed while maintaining competitive translation quality.
arXiv Detail & Related papers (2020-10-25T06:35:30Z) - FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire [74.04394069262108]
We propose FastLR, a non-autoregressive (NAR) lipreading model which generates all target tokens simultaneously.
FastLR achieves the speedup up to 10.97$times$ compared with state-of-the-art lipreading model.
arXiv Detail & Related papers (2020-08-06T08:28:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.