Incorporating BERT into Neural Machine Translation
- URL: http://arxiv.org/abs/2002.06823v1
- Date: Mon, 17 Feb 2020 08:13:36 GMT
- Title: Incorporating BERT into Neural Machine Translation
- Authors: Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin, Wengang Zhou,
Houqiang Li and Tie-Yan Liu
- Abstract summary: We propose a new algorithm named BERT-fused model, in which we first use BERT to extract representations for an input sequence.
We conduct experiments on supervised (including sentence-level and document-level translations), semi-supervised and unsupervised machine translation, and achieve state-of-the-art results on seven benchmark datasets.
- Score: 251.54280200353674
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The recently proposed BERT has shown great power on a variety of natural
language understanding tasks, such as text classification, reading
comprehension, etc. However, how to effectively apply BERT to neural machine
translation (NMT) lacks enough exploration. While BERT is more commonly used as
fine-tuning instead of contextual embedding for downstream language
understanding tasks, in NMT, our preliminary exploration of using BERT as
contextual embedding is better than using for fine-tuning. This motivates us to
think how to better leverage BERT for NMT along this direction. We propose a
new algorithm named BERT-fused model, in which we first use BERT to extract
representations for an input sequence, and then the representations are fused
with each layer of the encoder and decoder of the NMT model through attention
mechanisms. We conduct experiments on supervised (including sentence-level and
document-level translations), semi-supervised and unsupervised machine
translation, and achieve state-of-the-art results on seven benchmark datasets.
Our code is available at \url{https://github.com/bert-nmt/bert-nmt}.
Related papers
- PERT: Pre-training BERT with Permuted Language Model [24.92527883997854]
PERT is an auto-encoding model (like BERT) trained with Permuted Language Model (PerLM)
We permute a proportion of the input text, and the training objective is to predict the position of the original token.
We carried out extensive experiments on both Chinese and English NLU benchmarks.
arXiv Detail & Related papers (2022-03-14T07:58:34Z) - BERT-DRE: BERT with Deep Recursive Encoder for Natural Language Sentence
Matching [4.002351785644765]
This paper presents a deep neural architecture, for Natural Language Sentence Matching (NLSM) by adding a deep recursive encoder to BERT.
Our analysis of model behavior shows that BERT still does not capture the full complexity of text.
The BERT algorithm on the religious dataset achieved an accuracy of 89.70%, and BERT-DRE architectures improved to 90.29% using the same dataset.
arXiv Detail & Related papers (2021-11-03T12:56:13Z) - Phrase-level Active Learning for Neural Machine Translation [107.28450614074002]
We propose an active learning setting where we can spend a given budget on translating in-domain data.
We select both full sentences and individual phrases from unlabelled data in the new domain for routing to human translators.
In a German-English translation task, our active learning approach achieves consistent improvements over uncertainty-based sentence selection methods.
arXiv Detail & Related papers (2021-06-21T19:20:42Z) - Better Neural Machine Translation by Extracting Linguistic Information
from BERT [4.353029347463806]
Adding linguistic information to neural machine translation (NMT) has mostly focused on using point estimates from pre-trained models.
We augment NMT by extracting dense fine-tuned vector-based linguistic information from BERT instead of using point estimates.
arXiv Detail & Related papers (2021-04-07T00:03:51Z) - BERT-JAM: Boosting BERT-Enhanced Neural Machine Translation with Joint
Attention [9.366359346271567]
We propose a novel BERT-enhanced neural machine translation model called BERT-JAM.
BERT-JAM uses joint-attention modules to allow the encoder/decoder layers to dynamically allocate attention between different representations.
Our experiments show that BERT-JAM achieves SOTA BLEU scores on multiple translation tasks.
arXiv Detail & Related papers (2020-11-09T09:30:37Z) - Deep Clustering of Text Representations for Supervision-free Probing of
Syntax [51.904014754864875]
We consider part of speech induction (POSI) and constituency labelling (CoLab) in this work.
We find that Multilingual BERT (mBERT) contains surprising amount of syntactic knowledge of English.
We report competitive performance of our probe on 45-tag English POSI, state-of-the-art performance on 12-tag POSI across 10 languages, and competitive results on CoLab.
arXiv Detail & Related papers (2020-10-24T05:06:29Z) - Unsupervised Bitext Mining and Translation via Self-trained Contextual
Embeddings [51.47607125262885]
We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text.
We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training.
We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (absolute) in F1 scores over previous unsupervised methods.
arXiv Detail & Related papers (2020-10-15T14:04:03Z) - Incorporating BERT into Parallel Sequence Decoding with Adapters [82.65608966202396]
We propose to take two different BERT models as the encoder and decoder respectively, and fine-tune them by introducing simple and lightweight adapter modules.
We obtain a flexible and efficient model which is able to jointly leverage the information contained in the source-side and target-side BERT models.
Our framework is based on a parallel sequence decoding algorithm named Mask-Predict considering the bi-directional and conditional independent nature of BERT.
arXiv Detail & Related papers (2020-10-13T03:25:15Z) - CERT: Contrastive Self-supervised Learning for Language Understanding [20.17416958052909]
We propose CERT: Contrastive self-supervised Representations from Transformers.
CERT pretrains language representation models using contrastive self-supervised learning at the sentence level.
We evaluate CERT on 11 natural language understanding tasks in the GLUE benchmark where CERT outperforms BERT on 7 tasks, achieves the same performance as BERT on 2 tasks, and performs worse than BERT on 2 tasks.
arXiv Detail & Related papers (2020-05-16T16:20:38Z) - Cross-lingual Supervision Improves Unsupervised Neural Machine
Translation [97.84871088440102]
We introduce a multilingual unsupervised NMT framework to leverage weakly supervised signals from high-resource language pairs to zero-resource translation directions.
Method significantly improves the translation quality by more than 3 BLEU score on six benchmark unsupervised translation directions.
arXiv Detail & Related papers (2020-04-07T05:46:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.