Integrated Training for Sequence-to-Sequence Models Using
Non-Autoregressive Transformer
- URL: http://arxiv.org/abs/2109.12950v1
- Date: Mon, 27 Sep 2021 11:04:09 GMT
- Title: Integrated Training for Sequence-to-Sequence Models Using
Non-Autoregressive Transformer
- Authors: Evgeniia Tokarchuk, Jan Rosendahl, Weiyue Wang, Pavel Petrushkov,
Tomer Lancewicki, Shahram Khadivi, Hermann Ney
- Abstract summary: We propose a cascaded model based on the non-autoregressive Transformer that enables end-to-end training without the need for an explicit intermediate representation.
We conduct an evaluation on two pivot-based machine translation tasks, namely French-German and German-Czech.
- Score: 49.897891031932545
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Complex natural language applications such as speech translation or pivot
translation traditionally rely on cascaded models. However, cascaded models are
known to be prone to error propagation and model discrepancy problems.
Furthermore, there is no possibility of using end-to-end training data in
conventional cascaded systems, meaning that the training data most suited for
the task cannot be used. Previous studies suggested several approaches for
integrated end-to-end training to overcome those problems, however they mostly
rely on (synthetic or natural) three-way data. We propose a cascaded model
based on the non-autoregressive Transformer that enables end-to-end training
without the need for an explicit intermediate representation. This new
architecture (i) avoids unnecessary early decisions that can cause errors which
are then propagated throughout the cascaded models and (ii) utilizes the
end-to-end training data directly. We conduct an evaluation on two pivot-based
machine translation tasks, namely French-German and German-Czech. Our
experimental results show that the proposed architecture yields an improvement
of more than 2 BLEU for French-German over the cascaded baseline.
Related papers
- Bi-Drop: Enhancing Fine-tuning Generalization via Synchronous sub-net
Estimation and Optimization [58.90989478049686]
Bi-Drop is a fine-tuning strategy that selectively updates model parameters using gradients from various sub-nets.
Experiments on the GLUE benchmark demonstrate that Bi-Drop consistently outperforms previous fine-tuning methods.
arXiv Detail & Related papers (2023-05-24T06:09:26Z) - TWINS: A Fine-Tuning Framework for Improved Transferability of
Adversarial Robustness and Generalization [89.54947228958494]
This paper focuses on the fine-tuning of an adversarially pre-trained model in various classification tasks.
We propose a novel statistics-based approach, Two-WIng NormliSation (TWINS) fine-tuning framework.
TWINS is shown to be effective on a wide range of image classification datasets in terms of both generalization and robustness.
arXiv Detail & Related papers (2023-03-20T14:12:55Z) - Same Pre-training Loss, Better Downstream: Implicit Bias Matters for
Language Models [46.24479693469042]
This paper shows that 1) pre-training loss cannot fully explain downstream performance and 2) flatness of the model is well-correlated with downstream performance where pre-training loss is not.
arXiv Detail & Related papers (2022-10-25T17:45:36Z) - Uncertainty Estimation for Language Reward Models [5.33024001730262]
Language models can learn a range of capabilities from unsupervised training on text corpora.
It is often easier for humans to choose between options than to provide labeled data, and prior work has achieved state-of-the-art performance by training a reward model from such preference comparisons.
We seek to address these problems via uncertainty estimation, which can improve sample efficiency and robustness using active learning and risk-averse reinforcement learning.
arXiv Detail & Related papers (2022-03-14T20:13:21Z) - Exploring Strategies for Generalizable Commonsense Reasoning with
Pre-trained Models [62.28551903638434]
We measure the impact of three different adaptation methods on the generalization and accuracy of models.
Experiments with two models show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers.
We observe that alternative adaptation methods like prefix-tuning have comparable accuracy, but generalize better to unseen answers and are more robust to adversarial splits.
arXiv Detail & Related papers (2021-09-07T03:13:06Z) - NoiER: An Approach for Training more Reliable Fine-TunedDownstream Task
Models [54.184609286094044]
We propose noise entropy regularisation (NoiER) as an efficient learning paradigm that solves the problem without auxiliary models and additional data.
The proposed approach improved traditional OOD detection evaluation metrics by 55% on average compared to the original fine-tuned models.
arXiv Detail & Related papers (2021-08-29T06:58:28Z) - End-to-End Weak Supervision [15.125993628007972]
We propose an end-to-end approach for directly learning the downstream model.
We show improved performance over prior work in terms of end model performance on downstream test sets.
arXiv Detail & Related papers (2021-07-05T19:10:11Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Unsupervised Pretraining for Neural Machine Translation Using Elastic
Weight Consolidation [0.0]
This work presents our ongoing research of unsupervised pretraining in neural machine translation (NMT)
In our method, we initialize the weights of the encoder and decoder with two language models that are trained with monolingual data.
We show that initializing the bidirectional NMT encoder with a left-to-right language model and forcing the model to remember the original left-to-right language modeling task limits the learning capacity of the encoder.
arXiv Detail & Related papers (2020-10-19T11:51:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.