Optimizing Non-Autoregressive Transformers with Contrastive Learning
- URL: http://arxiv.org/abs/2305.13667v2
- Date: Fri, 2 Jun 2023 10:48:41 GMT
- Title: Optimizing Non-Autoregressive Transformers with Contrastive Learning
- Authors: Chenxin An, Jiangtao Feng, Fei Huang, Xipeng Qiu, Lingpeng Kong
- Abstract summary: Non-autoregressive Transformers (NATs) reduce the inference latency of Autoregressive Transformers (ATs) by predicting words all at once rather than in sequential order.
In this paper, we propose to ease the difficulty of modality learning via sampling from the model distribution instead of the data distribution.
- Score: 74.46714706658517
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Non-autoregressive Transformers (NATs) reduce the inference latency of
Autoregressive Transformers (ATs) by predicting words all at once rather than
in sequential order. They have achieved remarkable progress in machine
translation as well as many other applications. However, a long-standing
challenge for NATs is the learning of multi-modality data distribution, which
is the main cause of the performance gap between NATs and ATs. In this paper,
we propose to ease the difficulty of modality learning via sampling from the
model distribution instead of the data distribution. We derive contrastive
constraints to stabilize the training process and integrate this resulting
objective with the state-of-the-art NAT architecture DA-Transformer. Our model
\method is examined on 3 different tasks, including machine translation, text
summarization, and paraphrasing with 5 benchmarks. Results show that our
approach outperforms previous non-autoregressive baselines by a significant
margin and establishes new state-of-the-art results for non-autoregressive
transformers on all the benchmarks.
Related papers
- Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z) - Fuzzy Alignments in Directed Acyclic Graph for Non-Autoregressive
Machine Translation [18.205288788056787]
Non-autoregressive translation (NAT) reduces the decoding latency but suffers from performance degradation due to the multi-modality problem.
In this paper, we hold the view that all paths in the graph are fuzzily aligned with the reference sentence.
We do not require the exact alignment but train the model to maximize a fuzzy alignment score between the graph and reference, which takes translations captured in all modalities into account.
arXiv Detail & Related papers (2023-03-12T13:51:38Z) - Rephrasing the Reference for Non-Autoregressive Machine Translation [37.816198073720614]
Non-autoregressive neural machine translation (NAT) models suffer from the multi-modality problem that there may exist multiple possible translations of a source sentence.
We introduce a rephraser to provide a better training target for NAT by rephrasing the reference sentence according to the NAT output.
Our best variant achieves comparable performance to the autoregressive Transformer, while being 14.7 times more efficient in inference.
arXiv Detail & Related papers (2022-11-30T10:05:03Z) - Directed Acyclic Transformer for Non-Autoregressive Machine Translation [93.31114105366461]
Directed Acyclic Transfomer (DA-Transformer) represents hidden states in a Directed Acyclic Graph (DAG)
DA-Transformer substantially outperforms previous NATs by about 3 BLEU on average.
arXiv Detail & Related papers (2022-05-16T06:02:29Z) - Sequence-Level Training for Non-Autoregressive Neural Machine
Translation [33.17341980163439]
Non-Autoregressive Neural Machine Translation (NAT) removes the autoregressive mechanism and achieves significant decoding speedup.
We propose using sequence-level training objectives to train NAT models, which evaluate the NAT outputs as a whole and correlates well with the real translation quality.
arXiv Detail & Related papers (2021-06-15T13:30:09Z) - Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex.
This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z) - Enriching Non-Autoregressive Transformer with Syntactic and
SemanticStructures for Neural Machine Translation [54.864148836486166]
We propose to incorporate the explicit syntactic and semantic structures of languages into a non-autoregressive Transformer.
Our model achieves a significantly faster speed, as well as keeps the translation quality when compared with several state-of-the-art non-autoregressive models.
arXiv Detail & Related papers (2021-01-22T04:12:17Z) - Fully Non-autoregressive Neural Machine Translation: Tricks of the Trade [47.97977478431973]
Fully non-autoregressive neural machine translation (NAT) is proposed to simultaneously predict tokens with single forward of neural networks.
In this work, we target on closing the performance gap while maintaining the latency advantage.
arXiv Detail & Related papers (2020-12-31T18:52:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.