Hybrid-Regressive Neural Machine Translation
- URL: http://arxiv.org/abs/2210.10416v1
- Date: Wed, 19 Oct 2022 09:26:15 GMT
- Title: Hybrid-Regressive Neural Machine Translation
- Authors: Qiang Wang, Xinhui Hu, Ming Chen
- Abstract summary: We investigate how to combine the strengths of autoregressive and non-autoregressive translation paradigms better.
We propose a new two-stage translation prototype called hybrid-regressive translation (HRT)
HRT achieves the state-of-the-art BLEU score of 28.49 on the WMT En-De task and is at least 1.5x faster than AT, regardless of batch size and device.
- Score: 11.634586560239404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we empirically confirm that non-autoregressive translation with
an iterative refinement mechanism (IR-NAT) suffers from poor acceleration
robustness because it is more sensitive to decoding batch size and computing
device setting than autoregressive translation (AT). Inspired by it, we attempt
to investigate how to combine the strengths of autoregressive and
non-autoregressive translation paradigms better. To this end, we demonstrate
through synthetic experiments that prompting a small number of AT's predictions
can promote one-shot non-autoregressive translation to achieve the equivalent
performance of IR-NAT. Following this line, we propose a new two-stage
translation prototype called hybrid-regressive translation (HRT). Specifically,
HRT first generates discontinuous sequences via autoregression (e.g., make a
prediction every k tokens, k>1) and then fills in all previously skipped tokens
at once in a non-autoregressive manner. We also propose a bag of techniques to
effectively and efficiently train HRT without adding any model parameters. HRT
achieves the state-of-the-art BLEU score of 28.49 on the WMT En-De task and is
at least 1.5x faster than AT, regardless of batch size and device. In addition,
another bonus of HRT is that it successfully inherits the good characteristics
of AT in the deep-encoder-shallow-decoder architecture. Concretely, compared to
the vanilla HRT with a 6-layer encoder and 6-layer decoder, the inference speed
of HRT with a 12-layer encoder and 1-layer decoder is further doubled on both
GPU and CPU without BLEU loss.
Related papers
- Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive ASR [17.950722198543897]
We present textbfHybrid-textbfAutoregressive textbfINference TrtextbfANsducers (HAINAN), a novel architecture for speech recognition.
HAINAN supports both autoregressive inference with all network components and non-autoregressive inference without the predictor.
arXiv Detail & Related papers (2024-10-03T15:38:20Z) - CTC-based Non-autoregressive Speech Translation [51.37920141751813]
We investigate the potential of connectionist temporal classification for non-autoregressive speech translation.
We develop a model consisting of two encoders that are guided by CTC to predict the source and target texts.
Experiments on the MuST-C benchmarks show that our NAST model achieves an average BLEU score of 29.5 with a speed-up of 5.67$times$.
arXiv Detail & Related papers (2023-05-27T03:54:09Z) - The RoyalFlush System for the WMT 2022 Efficiency Task [11.00644143928471]
This paper describes the submission of the Royal neural machine translation system for the WMT 2022 translation efficiency task.
Unlike the commonly used autoregressive translation system, we adopted a two-stage translation paradigm called Hybrid Regression Translation.
Our fastest system reaches 6k+ words/second on the GPU latency setting, estimated to be about 3.1x faster than the last year's winner.
arXiv Detail & Related papers (2022-12-03T05:36:10Z) - Paraformer: Fast and Accurate Parallel Transformer for
Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer.
It accurately predicts the number of output tokens and extract hidden variables.
It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z) - Non-Autoregressive Neural Machine Translation: A Call for Clarity [3.1447111126465]
We take a step back and revisit several techniques that have been proposed for improving non-autoregressive translation models.
We provide novel insights for establishing strong baselines using length prediction or CTC-based architecture variants.
We contribute standardized BLEU, chrF++, and TER scores using sacreBLEU on four translation tasks.
arXiv Detail & Related papers (2022-05-21T12:15:22Z) - Non-Autoregressive Translation with Layer-Wise Prediction and Deep
Supervision [33.04082398101807]
Existing neural machine translation models, such as Transformer, achieve high performance, but they decode words one by one, which is inefficient.
Recent non-autoregressive translation models speed up the inference, but their quality is still inferior.
We propose DSLP, a highly efficient and high-performance model for machine translation.
arXiv Detail & Related papers (2021-10-14T16:36:12Z) - Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input [54.82369261350497]
We propose a CTC-enhanced NAR transformer, which generates target sequence by refining predictions of the CTC module.
Experimental results show that our method outperforms all previous NAR counterparts and achieves 50x faster decoding speed than a strong AR baseline with only 0.0 0.3 absolute CER degradation on Aishell-1 and Aishell-2 datasets.
arXiv Detail & Related papers (2020-10-28T15:00:09Z) - Iterative Refinement in the Continuous Space for Non-Autoregressive
Neural Machine Translation [68.25872110275542]
We propose an efficient inference procedure for non-autoregressive machine translation.
It iteratively refines translation purely in the continuous space.
We evaluate our approach on WMT'14 En-De, WMT'16 Ro-En and IWSLT'16 De-En.
arXiv Detail & Related papers (2020-09-15T15:30:14Z) - Glancing Transformer for Non-Autoregressive Neural Machine Translation [58.87258329683682]
We propose a method to learn word interdependency for single-pass parallel generation models.
With only single-pass parallel decoding, GLAT is able to generate high-quality translation with 8-15 times speedup.
arXiv Detail & Related papers (2020-08-18T13:04:03Z) - Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine
Translation [78.51887060865273]
We show that a single-layer autoregressive decoder can substantially outperform strong non-autoregressive models with comparable inference speed.
Our results establish a new protocol for future research toward fast, accurate machine translation.
arXiv Detail & Related papers (2020-06-18T09:06:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.