Improving Lexically Constrained Neural Machine Translation with
Source-Conditioned Masked Span Prediction
- URL: http://arxiv.org/abs/2105.05498v1
- Date: Wed, 12 May 2021 08:11:33 GMT
- Title: Improving Lexically Constrained Neural Machine Translation with
Source-Conditioned Masked Span Prediction
- Authors: Gyubok Lee, Seongjun Yang, Edward Choi
- Abstract summary: In this paper, we tackle a more challenging setup consisting of domain-specific corpora with much longer n-gram and highly specialized terms.
To encourage span-level representations in generation, we additionally impose a source-sentence conditioned masked span prediction loss in the decoder.
Experimental results on three domain-specific corpora in two language pairs demonstrate that the proposed training scheme can improve the performance of existing lexically constrained methods.
- Score: 6.46964825569749
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating accurate terminology is a crucial component for the practicality
and reliability of neural machine translation (NMT) systems. To address this,
lexically constrained NMT explores various methods to ensure pre-specified
words and phrases to appear in the translations. In many cases, however, those
methods are evaluated on general domain corpora, where the terms are mostly
uni- and bi-grams (>98%). In this paper, we instead tackle a more challenging
setup consisting of domain-specific corpora with much longer n-gram and highly
specialized terms. To encourage span-level representations in generation, we
additionally impose a source-sentence conditioned masked span prediction loss
in the decoder and observe improvements on both terminology translation as well
as BLEU scores. Experimental results on three domain-specific corpora in two
language pairs demonstrate that the proposed training scheme can improve the
performance of existing lexically constrained methods that can operate both
with or without a term dictionary at test time.
Related papers
- An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords.
We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z) - Towards Effective Disambiguation for Machine Translation with Large
Language Models [65.80775710657672]
We study the capabilities of large language models to translate "ambiguous sentences"
Experiments show that our methods can match or outperform state-of-the-art systems such as DeepL and NLLB in four out of five language directions.
arXiv Detail & Related papers (2023-09-20T22:22:52Z) - Learning Homographic Disambiguation Representation for Neural Machine
Translation [20.242134720005467]
Homographs, words with the same spelling but different meanings, remain challenging in Neural Machine Translation (NMT)
We propose a novel approach to tackle issues of NMT in the latent space.
We first train an encoder (aka " homographic-encoder") to learn universal sentence representations in a natural language inference (NLI) task.
We further fine-tune the encoder using homograph-based syn-set WordNet, enabling it to learn word-set representations from sentences.
arXiv Detail & Related papers (2023-04-12T13:42:59Z) - DICTDIS: Dictionary Constrained Disambiguation for Improved NMT [50.888881348723295]
We present DictDis, a lexically constrained NMT system that disambiguates between multiple candidate translations derived from dictionaries.
We demonstrate the utility of DictDis via extensive experiments on English-Hindi and English-German sentences in a variety of domains including regulatory, finance, engineering.
arXiv Detail & Related papers (2022-10-13T13:04:16Z) - DEEP: DEnoising Entity Pre-training for Neural Machine Translation [123.6686940355937]
It has been shown that machine translation models usually generate poor translations for named entities that are infrequent in the training corpus.
We propose DEEP, a DEnoising Entity Pre-training method that leverages large amounts of monolingual data and a knowledge base to improve named entity translation accuracy within sentences.
arXiv Detail & Related papers (2021-11-14T17:28:09Z) - Contextualized Semantic Distance between Highly Overlapped Texts [85.1541170468617]
Overlapping frequently occurs in paired texts in natural language processing tasks like text editing and semantic similarity evaluation.
This paper aims to address the issue with a mask-and-predict strategy.
We take the words in the longest common sequence as neighboring words and use masked language modeling (MLM) to predict the distributions on their positions.
Experiments on Semantic Textual Similarity show NDD to be more sensitive to various semantic differences, especially on highly overlapped paired texts.
arXiv Detail & Related papers (2021-10-04T03:59:15Z) - Phrase-level Active Learning for Neural Machine Translation [107.28450614074002]
We propose an active learning setting where we can spend a given budget on translating in-domain data.
We select both full sentences and individual phrases from unlabelled data in the new domain for routing to human translators.
In a German-English translation task, our active learning approach achieves consistent improvements over uncertainty-based sentence selection methods.
arXiv Detail & Related papers (2021-06-21T19:20:42Z) - Sentence Alignment with Parallel Documents Helps Biomedical Machine
Translation [0.5430741734728369]
This work presents a new unsupervised sentence alignment method and explores features in training biomedical neural machine translation (NMT) systems.
We use a simple but effective way to build bilingual word embeddings to evaluate bilingual word similarity.
The proposed method achieved high accuracy in both 1-to-1 and many-to-many cases.
arXiv Detail & Related papers (2021-04-17T16:09:30Z) - Decoding Time Lexical Domain Adaptationfor Neural Machine Translation [7.628949147902029]
Machine translation systems are vulnerable to domain mismatch, especially when the task is low-resource.
We present two simple methods for improving translation quality in this particular setting.
arXiv Detail & Related papers (2021-01-02T11:06:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.