Data Augmentation and Terminology Integration for Domain-Specific
Sinhala-English-Tamil Statistical Machine Translation
- URL: http://arxiv.org/abs/2011.02821v3
- Date: Wed, 3 Feb 2021 06:13:27 GMT
- Title: Data Augmentation and Terminology Integration for Domain-Specific
Sinhala-English-Tamil Statistical Machine Translation
- Authors: Aloka Fernando, Surangika Ranathunga, Gihan Dias
- Abstract summary: Out of vocabulary (OOV) is a problem in the context of Machine Translation (MT) in low-resourced languages.
This paper focuses on data augmentation techniques where bilingual lexicon terms are expanded based on case-markers.
- Score: 1.1470070927586016
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Out of vocabulary (OOV) is a problem in the context of Machine Translation
(MT) in low-resourced languages. When source and/or target languages are
morphologically rich, it becomes even worse. Bilingual list integration is an
approach to address the OOV problem. This allows more words to be translated
than are in the training data. However, since bilingual lists contain words in
the base form, it will not translate inflected forms for morphologically rich
languages such as Sinhala and Tamil. This paper focuses on data augmentation
techniques where bilingual lexicon terms are expanded based on case-markers
with the objective of generating new words, to be used in Statistical machine
Translation (SMT). This data augmentation technique for dictionary terms shows
improved BLEU scores for Sinhala-English SMT.
Related papers
- LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation [67.24113079928668]
We present LexMatcher, a method for data curation driven by the coverage of senses found in bilingual dictionaries.
Our approach outperforms the established baselines on the WMT2022 test sets.
arXiv Detail & Related papers (2024-06-03T15:30:36Z) - Integrating Multi-scale Contextualized Information for Byte-based Neural Machine Translation [14.826948179996695]
Subword tokenization is a common method for vocabulary building in Neural Machine Translation (NMT) models.
We propose the Multi-Scale Contextualization (MSC) method, which learns contextualized information of varying scales across different hidden state dimensions.
Experiments show that MSC significantly outperforms subword-based and other byte-based methods in both multilingual and out-of-domain scenarios.
arXiv Detail & Related papers (2024-05-29T17:19:04Z) - Cross-Lingual Transfer from Related Languages: Treating Low-Resource
Maltese as Multilingual Code-Switching [9.435669487585917]
We focus on Maltese, a Semitic language, with substantial influences from Arabic, Italian, and English, and notably written in Latin script.
We present a novel dataset annotated with word-level etymology.
We show that conditional transliteration based on word etymology yields the best results, surpassing fine-tuning with raw Maltese or Maltese processed with non-selective pipelines.
arXiv Detail & Related papers (2024-01-30T11:04:36Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Dict-NMT: Bilingual Dictionary based NMT for Extremely Low Resource
Languages [1.8787713898828164]
We present a detailed analysis of the effects of the quality of dictionaries, training dataset size, language family, etc., on the translation quality.
Results on multiple low-resource test languages show a clear advantage of our bilingual dictionary-based method over the baselines.
arXiv Detail & Related papers (2022-06-09T12:03:29Z) - DEEP: DEnoising Entity Pre-training for Neural Machine Translation [123.6686940355937]
It has been shown that machine translation models usually generate poor translations for named entities that are infrequent in the training corpus.
We propose DEEP, a DEnoising Entity Pre-training method that leverages large amounts of monolingual data and a knowledge base to improve named entity translation accuracy within sentences.
arXiv Detail & Related papers (2021-11-14T17:28:09Z) - Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural
Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages.
Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline.
Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - Morphological Word Segmentation on Agglutinative Languages for Neural
Machine Translation [8.87546236839959]
We propose a morphological word segmentation method on the source-side for Neural machine translation (NMT)
It incorporates morphology knowledge to preserve the linguistic and semantic information in the word structure while reducing the vocabulary size at training time.
It can be utilized as a preprocessing tool to segment the words in agglutinative languages for other natural language processing (NLP) tasks.
arXiv Detail & Related papers (2020-01-02T10:05:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.