Multiple Segmentations of Thai Sentences for Neural Machine Translation
- URL: http://arxiv.org/abs/2004.11472v1
- Date: Thu, 23 Apr 2020 21:48:58 GMT
- Title: Multiple Segmentations of Thai Sentences for Neural Machine Translation
- Authors: Alberto Poncelas, Wichaya Pidchamook, Chao-Hong Liu, James Hadley,
Andy Way
- Abstract summary: We show how to augment a set of English--Thai parallel data by replicating sentence-pairs with different word segmentation methods on Thai.
Experiments show that combining these datasets, performance is improved for NMT models trained with a dataset that has been split using a supervised splitting tool.
- Score: 6.1335228645093265
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Thai is a low-resource language, so it is often the case that data is not
available in sufficient quantities to train an Neural Machine Translation (NMT)
model which perform to a high level of quality. In addition, the Thai script
does not use white spaces to delimit the boundaries between words, which adds
more complexity when building sequence to sequence models. In this work, we
explore how to augment a set of English--Thai parallel data by replicating
sentence-pairs with different word segmentation methods on Thai, as training
data for NMT model training. Using different merge operations of Byte Pair
Encoding, different segmentations of Thai sentences can be obtained. The
experiments show that combining these datasets, performance is improved for NMT
models trained with a dataset that has been split using a supervised splitting
tool.
Related papers
- Towards Zero-Shot Multimodal Machine Translation [64.9141931372384]
We propose a method to bypass the need for fully supervised data to train multimodal machine translation systems.
Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives.
To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese.
arXiv Detail & Related papers (2024-07-18T15:20:31Z) - TAMS: Translation-Assisted Morphological Segmentation [3.666125285899499]
We present a sequence-to-sequence model for canonical morpheme segmentation.
Our model outperforms the baseline in a super-low resource setting but yields mixed results on training splits with more data.
While further work is needed to make translations useful in higher-resource settings, our model shows promise in severely resource-constrained settings.
arXiv Detail & Related papers (2024-03-21T21:23:35Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - Beyond Triplet: Leveraging the Most Data for Multimodal Machine
Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision.
Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets.
This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z) - Learning to Generalize to More: Continuous Semantic Augmentation for
Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT)
CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z) - Meta Back-translation [111.87397401837286]
We propose a novel method to generate pseudo-parallel data from a pre-trained back-translation model.
Our method is a meta-learning algorithm which adapts a pre-trained back-translation model so that the pseudo-parallel data it generates would train a forward-translation model to do well on a validation set.
arXiv Detail & Related papers (2021-02-15T20:58:32Z) - WangchanBERTa: Pretraining transformer-based Thai Language Models [2.186960190193067]
We pretrain a language model based on RoBERTa-base architecture on a large, deduplicated, cleaned training set (78GB in total size)
We apply text processing rules that are specific to Thai most importantly preserving spaces.
We also experiment with word-level, syllable-level and SentencePiece tokenization with a smaller dataset to explore the effects on tokenization on downstream performance.
arXiv Detail & Related papers (2021-01-24T03:06:34Z) - A Corpus for English-Japanese Multimodal Neural Machine Translation with
Comparable Sentences [21.43163704217968]
We propose a new multimodal English-Japanese corpus with comparable sentences that are compiled from existing image captioning datasets.
Due to low translation scores in our baseline experiments, we believe that current multimodal NMT models are not designed to effectively utilize comparable sentence data.
arXiv Detail & Related papers (2020-10-17T06:12:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.