Sentence Concatenation Approach to Data Augmentation for Neural Machine
Translation
- URL: http://arxiv.org/abs/2104.08478v1
- Date: Sat, 17 Apr 2021 08:04:42 GMT
- Title: Sentence Concatenation Approach to Data Augmentation for Neural Machine
Translation
- Authors: Seiichiro Kondo and Kengo Hotate and Masahiro Kaneko and Mamoru
Komachi
- Abstract summary: This study proposes a simple data augmentation method to handle long sentences.
We use only the given parallel corpora as the training data and generate long sentences by concatenating two sentences.
The translation quality is further improved by the proposed method, when combined with back-translation.
- Score: 22.316934668106526
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural machine translation (NMT) has recently gained widespread attention
because of its high translation accuracy. However, it shows poor performance in
the translation of long sentences, which is a major issue in low-resource
languages. It is assumed that this issue is caused by insufficient number of
long sentences in the training data. Therefore, this study proposes a simple
data augmentation method to handle long sentences. In this method, we use only
the given parallel corpora as the training data and generate long sentences by
concatenating two sentences. Based on the experimental results, we confirm
improvements in long sentence translation by the proposed data augmentation
method, despite its simplicity. Moreover, the translation quality is further
improved by the proposed method, when combined with back-translation.
Related papers
- Simplifying Translations for Children: Iterative Simplification Considering Age of Acquisition with LLMs [19.023628411128406]
We propose a method that replaces words with high Age of Acquisitions (AoA) in translations with simpler words to match the translations to the user's level.
The experimental results obtained from the dataset show that our method effectively replaces high-AoA words with lower-AoA words.
arXiv Detail & Related papers (2024-08-08T04:57:36Z) - Crossing the Threshold: Idiomatic Machine Translation through Retrieval
Augmentation and Loss Weighting [66.02718577386426]
We provide a simple characterization of idiomatic translation and related issues.
We conduct a synthetic experiment revealing a tipping point at which transformer-based machine translation models correctly default to idiomatic translations.
To improve translation of natural idioms, we introduce two straightforward yet effective techniques.
arXiv Detail & Related papers (2023-10-10T23:47:25Z) - DEEP: DEnoising Entity Pre-training for Neural Machine Translation [123.6686940355937]
It has been shown that machine translation models usually generate poor translations for named entities that are infrequent in the training corpus.
We propose DEEP, a DEnoising Entity Pre-training method that leverages large amounts of monolingual data and a knowledge base to improve named entity translation accuracy within sentences.
arXiv Detail & Related papers (2021-11-14T17:28:09Z) - Monotonic Simultaneous Translation with Chunk-wise Reordering and
Refinement [38.89496608319392]
We propose an algorithm to reorder and refine the target side of a full sentence translation corpus.
The words/phrases between the source and target sentences are aligned largely monotonically, using word alignment and non-autoregressive neural machine translation.
The proposed approach improves BLEU scores and resulting translations exhibit enhanced monotonicity with source sentences.
arXiv Detail & Related papers (2021-10-18T22:51:21Z) - Phrase-level Active Learning for Neural Machine Translation [107.28450614074002]
We propose an active learning setting where we can spend a given budget on translating in-domain data.
We select both full sentences and individual phrases from unlabelled data in the new domain for routing to human translators.
In a German-English translation task, our active learning approach achieves consistent improvements over uncertainty-based sentence selection methods.
arXiv Detail & Related papers (2021-06-21T19:20:42Z) - On the Language Coverage Bias for Neural Machine Translation [81.81456880770762]
Language coverage bias is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice.
By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data.
We propose two simple and effective approaches to alleviate the language coverage bias problem.
arXiv Detail & Related papers (2021-06-07T01:55:34Z) - Self-Training Sampling with Monolingual Data Uncertainty for Neural
Machine Translation [98.83925811122795]
We propose to improve the sampling procedure by selecting the most informative monolingual sentences to complement the parallel data.
We compute the uncertainty of monolingual sentences using the bilingual dictionary extracted from the parallel data.
Experimental results on large-scale WMT English$Rightarrow$German and English$Rightarrow$Chinese datasets demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2021-06-02T05:01:36Z) - Syntax-aware Data Augmentation for Neural Machine Translation [76.99198797021454]
We propose a novel data augmentation strategy for neural machine translation.
We set sentence-specific probability for word selection by considering their roles in sentence.
Our proposed method is evaluated on WMT14 English-to-German dataset and IWSLT14 German-to-English dataset.
arXiv Detail & Related papers (2020-04-29T13:45:30Z) - Incorporating Bilingual Dictionaries for Low Resource Semi-Supervised
Neural Machine Translation [5.958653653305609]
We incorporate widely available bilingual dictionaries that yield word-by-word translations to generate synthetic sentences.
This automatically expands the vocabulary of the model while maintaining high quality content.
arXiv Detail & Related papers (2020-04-05T02:14:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.