Improving Simultaneous Machine Translation with Monolingual Data
- URL: http://arxiv.org/abs/2212.01188v1
- Date: Fri, 2 Dec 2022 14:13:53 GMT
- Title: Improving Simultaneous Machine Translation with Monolingual Data
- Authors: Hexuan Deng, Liang Ding, Xuebo Liu, Meishan Zhang, Dacheng Tao, Min
Zhang
- Abstract summary: Simultaneous machine translation (SiMT) is usually done via sequence-level knowledge distillation (Seq-KD) from a full-sentence neural machine translation (NMT) model.
We propose to leverage monolingual data to improve SiMT, which trains a SiMT student on the combination of bilingual data and external monolingual data distilled by Seq-KD.
- Score: 94.1085601198393
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Simultaneous machine translation (SiMT) is usually done via sequence-level
knowledge distillation (Seq-KD) from a full-sentence neural machine translation
(NMT) model. However, there is still a significant performance gap between NMT
and SiMT. In this work, we propose to leverage monolingual data to improve
SiMT, which trains a SiMT student on the combination of bilingual data and
external monolingual data distilled by Seq-KD. Preliminary experiments on En-Zh
and En-Ja news domain corpora demonstrate that monolingual data can
significantly improve translation quality (e.g., +3.15 BLEU on En-Zh). Inspired
by the behavior of human simultaneous interpreters, we propose a novel
monolingual sampling strategy for SiMT, considering both chunk length and
monotonicity. Experimental results show that our sampling strategy consistently
outperforms the random sampling strategy (and other conventional typical NMT
monolingual sampling strategies) by avoiding the key problem of SiMT --
hallucination, and has better scalability. We achieve +0.72 BLEU improvements
on average against random sampling on En-Zh and En-Ja. Data and codes can be
found at https://github.com/hexuandeng/Mono4SiMT.
Related papers
- Towards Zero-Shot Multimodal Machine Translation [64.9141931372384]
We propose a method to bypass the need for fully supervised data to train multimodal machine translation systems.
Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives.
To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese.
arXiv Detail & Related papers (2024-07-18T15:20:31Z) - Beyond Triplet: Leveraging the Most Data for Multimodal Machine
Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision.
Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets.
This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z) - Integrating Unsupervised Data Generation into Self-Supervised Neural
Machine Translation for Low-Resource Languages [25.33888871213517]
Unsupervised machine translation (UMT) exploits large amounts of monolingual data.
Self-supervised NMT (SSNMT) identifies parallel sentences in smaller comparable data and trains on them.
We show that including UMT techniques into SSNMT significantly outperforms SSNMT and UMT on all tested language pairs.
arXiv Detail & Related papers (2021-07-19T11:56:03Z) - Alternated Training with Synthetic and Authentic Data for Neural Machine
Translation [49.35605028467887]
We propose alternated training with synthetic and authentic data for neural machine translation (NMT)
Compared with previous work, we introduce authentic data as guidance to prevent the training of NMT models from being disturbed by noisy synthetic data.
Experiments on Chinese-English and German-English translation tasks show that our approach improves the performance over several strong baselines.
arXiv Detail & Related papers (2021-06-16T07:13:16Z) - Self-Training Sampling with Monolingual Data Uncertainty for Neural
Machine Translation [98.83925811122795]
We propose to improve the sampling procedure by selecting the most informative monolingual sentences to complement the parallel data.
We compute the uncertainty of monolingual sentences using the bilingual dictionary extracted from the parallel data.
Experimental results on large-scale WMT English$Rightarrow$German and English$Rightarrow$Chinese datasets demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2021-06-02T05:01:36Z) - Translating the Unseen? Yor\`ub\'a $\rightarrow$ English MT in
Low-Resource, Morphologically-Unmarked Settings [8.006185289499049]
Translating between languages where certain features are marked morphologically in one but absent or marked contextually in the other is an important test case for machine translation.
In this work, we perform fine-grained analysis on how an SMT system compares with two NMT systems when translating bare nouns in Yorub'a into English.
arXiv Detail & Related papers (2021-03-07T01:24:09Z) - Cross-lingual Supervision Improves Unsupervised Neural Machine
Translation [97.84871088440102]
We introduce a multilingual unsupervised NMT framework to leverage weakly supervised signals from high-resource language pairs to zero-resource translation directions.
Method significantly improves the translation quality by more than 3 BLEU score on six benchmark unsupervised translation directions.
arXiv Detail & Related papers (2020-04-07T05:46:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.