Diformer: Directional Transformer for Neural Machine Translation
- URL: http://arxiv.org/abs/2112.11632v1
- Date: Wed, 22 Dec 2021 02:35:29 GMT
- Title: Diformer: Directional Transformer for Neural Machine Translation
- Authors: Minghan Wang, Jiaxin Guo, Yuxia Wang, Daimeng Wei, Hengchao Shang,
Chang Su, Yimeng Chen, Yinglu Li, Min Zhang, Shimin Tao, Hao Yang
- Abstract summary: Autoregressive (AR) and Non-autoregressive (NAR) models have their own superiority on the performance and latency.
We propose the Directional Transformer (Diformer) by jointly modelling AR and NAR into three generation directions.
Experiments on 4 WMT benchmarks demonstrate that Diformer outperforms current united-modelling works with more than 1.5 BLEU points for both AR and NAR decoding.
- Score: 13.867255817435705
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Autoregressive (AR) and Non-autoregressive (NAR) models have their own
superiority on the performance and latency, combining them into one model may
take advantage of both. Current combination frameworks focus more on the
integration of multiple decoding paradigms with a unified generative model,
e.g. Masked Language Model. However, the generalization can be harmful to the
performance due to the gap between training objective and inference. In this
paper, we aim to close the gap by preserving the original objective of AR and
NAR under a unified framework. Specifically, we propose the Directional
Transformer (Diformer) by jointly modelling AR and NAR into three generation
directions (left-to-right, right-to-left and straight) with a newly introduced
direction variable, which works by controlling the prediction of each token to
have specific dependencies under that direction. The unification achieved by
direction successfully preserves the original dependency assumption used in AR
and NAR, retaining both generalization and performance. Experiments on 4 WMT
benchmarks demonstrate that Diformer outperforms current united-modelling works
with more than 1.5 BLEU points for both AR and NAR decoding, and is also
competitive to the state-of-the-art independent AR and NAR models.
Related papers
- Leveraging Diverse Modeling Contexts with Collaborating Learning for
Neural Machine Translation [26.823126615724888]
Autoregressive (AR) and Non-autoregressive (NAR) models are two types of generative models for Neural Machine Translation (NMT)
We propose a novel generic collaborative learning method, DCMCL, where AR and NAR models are treated as collaborators instead of teachers and students.
arXiv Detail & Related papers (2024-02-28T15:55:02Z) - Distilling Autoregressive Models to Obtain High-Performance
Non-Autoregressive Solvers for Vehicle Routing Problems with Faster Inference
Speed [8.184624214651283]
We propose a generic Guided Non-Autoregressive Knowledge Distillation (GNARKD) method to obtain high-performance NAR models having a low inference latency.
We evaluate GNARKD by applying it to three widely adopted AR models to obtain NAR VRP solvers for both synthesized and real-world instances.
arXiv Detail & Related papers (2023-12-19T07:13:32Z) - Paraformer: Fast and Accurate Parallel Transformer for
Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer.
It accurately predicts the number of output tokens and extract hidden variables.
It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z) - On the Role of Bidirectionality in Language Model Pre-Training [85.14614350372004]
We study the role of bidirectionality in next token prediction, text infilling, zero-shot priming and fine-tuning.
We train models with up to 6.7B parameters, and find differences to remain consistent at scale.
arXiv Detail & Related papers (2022-05-24T02:25:05Z) - A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text
Generation [59.64193903397301]
Non-autoregressive (NAR) models simultaneously generate multiple outputs in a sequence, which significantly reduces the inference speed at the cost of accuracy drop compared to autoregressive baselines.
We conduct a comparative study of various NAR modeling methods for end-to-end automatic speech recognition (ASR)
The results on various tasks provide interesting findings for developing an understanding of NAR ASR, such as the accuracy-speed trade-off and robustness against long-form utterances.
arXiv Detail & Related papers (2021-10-11T13:05:06Z) - TSNAT: Two-Step Non-Autoregressvie Transformer Models for Speech
Recognition [69.68154370877615]
The non-autoregressive (NAR) models can get rid of the temporal dependency between the output tokens and predict the entire output tokens in at least one step.
To address these two problems, we propose a new model named the two-step non-autoregressive transformer(TSNAT)
The results show that the TSNAT can achieve a competitive performance with the AR model and outperform many complicated NAR models.
arXiv Detail & Related papers (2021-04-04T02:34:55Z) - An EM Approach to Non-autoregressive Conditional Sequence Generation [49.11858479436565]
Autoregressive (AR) models have been the dominating approach to conditional sequence generation.
Non-autoregressive (NAR) models have been recently proposed to reduce the latency by generating all output tokens in parallel.
This paper proposes a new approach that jointly optimize both AR and NAR models in a unified Expectation-Maximization framework.
arXiv Detail & Related papers (2020-06-29T20:58:57Z) - A Study of Non-autoregressive Model for Sequence Generation [147.89525760170923]
Non-autoregressive (NAR) models generate all the tokens of a sequence in parallel.
We propose knowledge distillation and source-target alignment to bridge the gap between AR and NAR models.
arXiv Detail & Related papers (2020-04-22T09:16:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.