The RoyalFlush System for the WMT 2022 Efficiency Task
- URL: http://arxiv.org/abs/2212.01543v1
- Date: Sat, 3 Dec 2022 05:36:10 GMT
- Title: The RoyalFlush System for the WMT 2022 Efficiency Task
- Authors: Bo Qin, Aixin Jia, Qiang Wang, Jianning Lu, Shuqin Pan, Haibo Wang,
Ming Chen
- Abstract summary: This paper describes the submission of the Royal neural machine translation system for the WMT 2022 translation efficiency task.
Unlike the commonly used autoregressive translation system, we adopted a two-stage translation paradigm called Hybrid Regression Translation.
Our fastest system reaches 6k+ words/second on the GPU latency setting, estimated to be about 3.1x faster than the last year's winner.
- Score: 11.00644143928471
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper describes the submission of the RoyalFlush neural machine
translation system for the WMT 2022 translation efficiency task. Unlike the
commonly used autoregressive translation system, we adopted a two-stage
translation paradigm called Hybrid Regression Translation (HRT) to combine the
advantages of autoregressive and non-autoregressive translation. Specifically,
HRT first autoregressively generates a discontinuous sequence (e.g., make a
prediction every $k$ tokens, $k>1$) and then fills in all previously skipped
tokens at once in a non-autoregressive manner. Thus, we can easily trade off
the translation quality and speed by adjusting $k$. In addition, by integrating
other modeling techniques (e.g., sequence-level knowledge distillation and
deep-encoder-shallow-decoder layer allocation strategy) and a mass of
engineering efforts, HRT improves 80\% inference speed and achieves equivalent
translation performance with the same-capacity AT counterpart. Our fastest
system reaches 6k+ words/second on the GPU latency setting, estimated to be
about 3.1x faster than the last year's winner.
Related papers
- Hybrid-Regressive Neural Machine Translation [11.634586560239404]
We investigate how to combine the strengths of autoregressive and non-autoregressive translation paradigms better.
We propose a new two-stage translation prototype called hybrid-regressive translation (HRT)
HRT achieves the state-of-the-art BLEU score of 28.49 on the WMT En-De task and is at least 1.5x faster than AT, regardless of batch size and device.
arXiv Detail & Related papers (2022-10-19T09:26:15Z) - Non-Autoregressive Translation with Layer-Wise Prediction and Deep
Supervision [33.04082398101807]
Existing neural machine translation models, such as Transformer, achieve high performance, but they decode words one by one, which is inefficient.
Recent non-autoregressive translation models speed up the inference, but their quality is still inferior.
We propose DSLP, a highly efficient and high-performance model for machine translation.
arXiv Detail & Related papers (2021-10-14T16:36:12Z) - The Volctrans GLAT System: Non-autoregressive Translation Meets WMT21 [25.41660831320743]
We build a parallel (i.e., non-autoregressive) translation system using the Glancing Transformer.
Our system achieves the best BLEU score (35.0) on German->English translation task, outperforming all strong autoregressive counterparts.
arXiv Detail & Related papers (2021-09-23T09:41:44Z) - The USYD-JD Speech Translation System for IWSLT 2021 [85.64797317290349]
This paper describes the University of Sydney& JD's joint submission of the IWSLT 2021 low resource speech translation task.
We trained our models with the officially provided ASR and MT datasets.
To achieve better translation performance, we explored the most recent effective strategies, including back translation, knowledge distillation, multi-feature reranking and transductive finetuning.
arXiv Detail & Related papers (2021-07-24T09:53:34Z) - Source and Target Bidirectional Knowledge Distillation for End-to-end
Speech Translation [88.78138830698173]
We focus on sequence-level knowledge distillation (SeqKD) from external text-based NMT models.
We train a bilingual E2E-ST model to predict paraphrased transcriptions as an auxiliary task with a single decoder.
arXiv Detail & Related papers (2021-04-13T19:00:51Z) - Future-Guided Incremental Transformer for Simultaneous Translation [6.8452940299620435]
Simultaneous translation (ST) starts synchronously while reading source sentences, and is used in many online scenarios.
wait-k policy faces two weaknesses: low training speed caused by the recalculation of hidden states and lack of future source information to guide training.
We propose an incremental Transformer with an average embedding layer (AEL) to accelerate the speed of calculation of hidden states.
arXiv Detail & Related papers (2020-12-23T03:04:49Z) - Incorporating a Local Translation Mechanism into Non-autoregressive
Translation [28.678752678905244]
We introduce a novel local autoregressive translation mechanism into non-autoregressive translation (NAT) models.
For each target decoding position, instead of only one token, we predict a short sequence of tokens in an autoregressive way.
We design an efficient merging algorithm to align and merge the out-put pieces into one final output sequence.
arXiv Detail & Related papers (2020-11-12T00:32:51Z) - Iterative Refinement in the Continuous Space for Non-Autoregressive
Neural Machine Translation [68.25872110275542]
We propose an efficient inference procedure for non-autoregressive machine translation.
It iteratively refines translation purely in the continuous space.
We evaluate our approach on WMT'14 En-De, WMT'16 Ro-En and IWSLT'16 De-En.
arXiv Detail & Related papers (2020-09-15T15:30:14Z) - Glancing Transformer for Non-Autoregressive Neural Machine Translation [58.87258329683682]
We propose a method to learn word interdependency for single-pass parallel generation models.
With only single-pass parallel decoding, GLAT is able to generate high-quality translation with 8-15 times speedup.
arXiv Detail & Related papers (2020-08-18T13:04:03Z) - Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine
Translation [78.51887060865273]
We show that a single-layer autoregressive decoder can substantially outperform strong non-autoregressive models with comparable inference speed.
Our results establish a new protocol for future research toward fast, accurate machine translation.
arXiv Detail & Related papers (2020-06-18T09:06:49Z) - Non-Autoregressive Machine Translation with Disentangled Context
Transformer [70.95181466892795]
State-of-the-art neural machine translation models generate a translation from left to right and every step is conditioned on the previously generated tokens.
We propose an attention-masking based model, called Disentangled Context (DisCo) transformer, that simultaneously generates all tokens given different contexts.
Our model achieves competitive, if not better, performance compared to the state of the art in non-autoregressive machine translation while significantly reducing decoding time on average.
arXiv Detail & Related papers (2020-01-15T05:32:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.