Synthetic Source Language Augmentation for Colloquial Neural Machine
Translation
- URL: http://arxiv.org/abs/2012.15178v1
- Date: Wed, 30 Dec 2020 14:52:15 GMT
- Title: Synthetic Source Language Augmentation for Colloquial Neural Machine
Translation
- Authors: Asrul Sani Ariesandy, Mukhlis Amien, Alham Fikri Aji, Radityo Eko
Prasojo
- Abstract summary: We develop a novel colloquial Indonesian-English test-set collected from YouTube transcript and Twitter.
We perform synthetic style augmentation to the source of formal Indonesian language and show that it improves the baseline Id-En models.
- Score: 3.303435360096988
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural machine translation (NMT) is typically domain-dependent and
style-dependent, and it requires lots of training data. State-of-the-art NMT
models often fall short in handling colloquial variations of its source
language and the lack of parallel data in this regard is a challenging hurdle
in systematically improving the existing models. In this work, we develop a
novel colloquial Indonesian-English test-set collected from YouTube transcript
and Twitter. We perform synthetic style augmentation to the source of formal
Indonesian language and show that it improves the baseline Id-En models (in
BLEU) over the new test data.
Related papers
- Extending Multilingual Machine Translation through Imitation Learning [60.15671816513614]
Imit-MNMT treats the task as an imitation learning process, which mimicks the behavior of an expert.
We show that our approach significantly improves the translation performance between the new and the original languages.
We also demonstrate that our approach is capable of solving copy and off-target problems.
arXiv Detail & Related papers (2023-11-14T21:04:03Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - Learning to Generalize to More: Continuous Semantic Augmentation for
Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT)
CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z) - Language Modeling, Lexical Translation, Reordering: The Training Process
of NMT through the Lens of Classical SMT [64.1841519527504]
neural machine translation uses a single neural network to model the entire translation process.
Despite neural machine translation being de-facto standard, it is still not clear how NMT models acquire different competences over the course of training.
arXiv Detail & Related papers (2021-09-03T09:38:50Z) - Alternated Training with Synthetic and Authentic Data for Neural Machine
Translation [49.35605028467887]
We propose alternated training with synthetic and authentic data for neural machine translation (NMT)
Compared with previous work, we introduce authentic data as guidance to prevent the training of NMT models from being disturbed by noisy synthetic data.
Experiments on Chinese-English and German-English translation tasks show that our approach improves the performance over several strong baselines.
arXiv Detail & Related papers (2021-06-16T07:13:16Z) - PheMT: A Phenomenon-wise Dataset for Machine Translation Robustness on
User-Generated Contents [40.25277134147149]
We present a new dataset, PheMT, for evaluating the robustness of MT systems against specific linguistic phenomena in Japanese-English translation.
Our experiments with the created dataset revealed that not only our in-house models but even widely used off-the-shelf systems are greatly disturbed by the presence of certain phenomena.
arXiv Detail & Related papers (2020-11-04T04:44:47Z) - Enhanced back-translation for low resource neural machine translation
using self-training [0.0]
This work proposes a self-training strategy where the output of the backward model is used to improve the model itself through the forward translation technique.
The technique was shown to improve baseline low resource IWSLT'14 English-German and IWSLT'15 English-Vietnamese backward translation models by 11.06 and 1.5 BLEUs respectively.
The synthetic data generated by the improved English-German backward model was used to train a forward model which out-performed another forward model trained using standard back-translation by 2.7 BLEU.
arXiv Detail & Related papers (2020-06-04T14:19:52Z) - An In-depth Walkthrough on Evolution of Neural Machine Translation [0.0]
This paper aims to study the major trends in Neural Machine Translation, the state of the art models in the domain and a high level comparison between them.
arXiv Detail & Related papers (2020-04-10T04:21:05Z) - Learning Contextualized Sentence Representations for Document-Level
Neural Machine Translation [59.191079800436114]
Document-level machine translation incorporates inter-sentential dependencies into the translation of a source sentence.
We propose a new framework to model cross-sentence dependencies by training neural machine translation (NMT) to predict both the target translation and surrounding sentences of a source sentence.
arXiv Detail & Related papers (2020-03-30T03:38:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.