BERT, mBERT, or BiBERT? A Study on Contextualized Embeddings for Neural
Machine Translation
- URL: http://arxiv.org/abs/2109.04588v1
- Date: Thu, 9 Sep 2021 23:43:41 GMT
- Title: BERT, mBERT, or BiBERT? A Study on Contextualized Embeddings for Neural
Machine Translation
- Authors: Haoran Xu, Benjamin Van Durme, Kenton Murray
- Abstract summary: We show that a tailored and suitable bilingual pre-trained language model (dubbed BiBERT) achieves state-of-the-art translation performance.
Our best models achieve BLEU scores of 30.45 for En->De and 38.61 for De->En on the IWSLT'14 dataset, and 31.26 for En->De and 34.94 for De->En on the WMT'14 dataset.
- Score: 38.017030073108735
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The success of bidirectional encoders using masked language models, such as
BERT, on numerous natural language processing tasks has prompted researchers to
attempt to incorporate these pre-trained models into neural machine translation
(NMT) systems. However, proposed methods for incorporating pre-trained models
are non-trivial and mainly focus on BERT, which lacks a comparison of the
impact that other pre-trained models may have on translation performance. In
this paper, we demonstrate that simply using the output (contextualized
embeddings) of a tailored and suitable bilingual pre-trained language model
(dubbed BiBERT) as the input of the NMT encoder achieves state-of-the-art
translation performance. Moreover, we also propose a stochastic layer selection
approach and a concept of dual-directional translation model to ensure the
sufficient utilization of contextualized embeddings. In the case of without
using back translation, our best models achieve BLEU scores of 30.45 for En->De
and 38.61 for De->En on the IWSLT'14 dataset, and 31.26 for En->De and 34.94
for De->En on the WMT'14 dataset, which exceeds all published numbers.
Related papers
- Efficient Machine Translation with a BiLSTM-Attention Approach [0.0]
This paper proposes a novel Seq2Seq model aimed at improving translation quality while reducing the storage space required by the model.
The model employs a Bidirectional Long Short-Term Memory network (Bi-LSTM) as the encoder to capture the context information of the input sequence.
Compared to the current mainstream Transformer model, our model achieves superior performance on the WMT14 machine translation dataset.
arXiv Detail & Related papers (2024-10-29T01:12:50Z) - A Paradigm Shift in Machine Translation: Boosting Translation
Performance of Large Language Models [27.777372498182864]
We propose a novel fine-tuning approach for Generative Large Language Models (LLMs)
Our approach consists of two fine-tuning stages: initial fine-tuning on monolingual data followed by subsequent fine-tuning on a small set of high-quality parallel data.
Based on LLaMA-2 as our underlying model, our results show that the model can achieve an average improvement of more than 12 BLEU and 12 COMET over its zero-shot performance.
arXiv Detail & Related papers (2023-09-20T22:53:15Z) - Improving Non-autoregressive Translation Quality with Pretrained Language Model, Embedding Distillation and Upsampling Strategy for CTC [51.34222224728979]
This paper introduces a series of innovative techniques to enhance the translation quality of Non-Autoregressive Translation (NAT) models.
We propose fine-tuning Pretrained Multilingual Language Models (PMLMs) with the CTC loss to train NAT models effectively.
Our model exhibits a remarkable speed improvement of 16.35 times compared to the autoregressive model.
arXiv Detail & Related papers (2023-06-10T05:24:29Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - Better Datastore, Better Translation: Generating Datastores from
Pre-Trained Models for Nearest Neural Machine Translation [48.58899349349702]
Nearest Neighbor Machine Translation (kNNMT) is a simple and effective method of augmenting neural machine translation (NMT) with a token-level nearest neighbor retrieval mechanism.
In this paper, we propose PRED, a framework that leverages Pre-trained models for Datastores in kNN-MT.
arXiv Detail & Related papers (2022-12-17T08:34:20Z) - Improving Neural Machine Translation by Bidirectional Training [85.64797317290349]
We present a simple and effective pretraining strategy -- bidirectional training (BiT) for neural machine translation.
Specifically, we bidirectionally update the model parameters at the early stage and then tune the model normally.
Experimental results show that BiT pushes the SOTA neural machine translation performance across 15 translation tasks on 8 language pairs significantly higher.
arXiv Detail & Related papers (2021-09-16T07:58:33Z) - Pronoun-Targeted Fine-tuning for NMT with Hybrid Losses [6.596002578395152]
We introduce a class of conditional generative-discriminative hybrid losses that we use to fine-tune a trained machine translation model.
We improve the model performance of both a sentence-level and a contextual model without using any additional data.
Our sentence-level model shows a 0.5 BLEU improvement on both the WMT14 and the IWSLT13 De-En testsets.
Our contextual model achieves the best results, improving from 31.81 to 32 BLEU on WMT14 De-En testset, and from 32.10 to 33.13 on the IWSLT13 De-En
arXiv Detail & Related papers (2020-10-15T10:11:40Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z) - Abstractive Text Summarization based on Language Model Conditioning and
Locality Modeling [4.525267347429154]
We train a Transformer-based neural model on the BERT language model.
In addition, we propose a new method of BERT-windowing, which allows chunk-wise processing of texts longer than the BERT window size.
The results of our models are compared to a baseline and the state-of-the-art models on the CNN/Daily Mail dataset.
arXiv Detail & Related papers (2020-03-29T14:00:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.