A Tulu Resource for Machine Translation
- URL: http://arxiv.org/abs/2403.19142v1
- Date: Thu, 28 Mar 2024 04:30:07 GMT
- Title: A Tulu Resource for Machine Translation
- Authors: Manu Narayanan, Noƫmi Aepli,
- Abstract summary: We present the first parallel dataset for English-Tulu translation.
Tulu is spoken by approximately 2.5 million individuals in southwestern India.
Our English-Tulu system, trained without using parallel English-Tulu data, outperforms Google Translate by 19 BLEU points.
- Score: 3.038642416291856
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present the first parallel dataset for English-Tulu translation. Tulu, classified within the South Dravidian linguistic family branch, is predominantly spoken by approximately 2.5 million individuals in southwestern India. Our dataset is constructed by integrating human translations into the multilingual machine translation resource FLORES-200. Furthermore, we use this dataset for evaluation purposes in developing our English-Tulu machine translation model. For the model's training, we leverage resources available for related South Dravidian languages. We adopt a transfer learning approach that exploits similarities between high-resource and low-resource languages. This method enables the training of a machine translation system even in the absence of parallel data between the source and target language, thereby overcoming a significant obstacle in machine translation development for low-resource languages. Our English-Tulu system, trained without using parallel English-Tulu data, outperforms Google Translate by 19 BLEU points (in September 2023). The dataset and code are available here: https://github.com/manunarayanan/Tulu-NMT.
Related papers
- The Effect of Normalization for Bi-directional Amharic-English Neural
Machine Translation [53.907805815477126]
This paper presents the first relatively large-scale Amharic-English parallel sentence dataset.
We build bi-directional Amharic-English translation models by fine-tuning the existing Facebook M2M100 pre-trained model.
The results show that the normalization of Amharic homophone characters increases the performance of Amharic-English machine translation in both directions.
arXiv Detail & Related papers (2022-10-27T07:18:53Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality
Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee.
It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z) - Extremely low-resource machine translation for closely related languages [0.0]
This work focuses on closely related languages from the Uralic language family: from Estonian and Finnish.
We find that multilingual learning and synthetic corpora increase the translation quality in every language pair.
We show that transfer learning and fine-tuning are very effective for doing low-resource machine translation and achieve the best results.
arXiv Detail & Related papers (2021-05-27T11:27:06Z) - Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural
Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages.
Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline.
Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z) - scb-mt-en-th-2020: A Large English-Thai Parallel Corpus [3.3072037841206354]
We construct an English-Thai machine translation dataset with over 1 million segment pairs.
We train machine translation models based on this dataset.
The dataset, pre-trained models, and source code to reproduce our work are available for public use.
arXiv Detail & Related papers (2020-07-07T15:14:32Z) - An Augmented Translation Technique for low Resource language pair:
Sanskrit to Hindi translation [0.0]
In this work, Zero Shot Translation (ZST) is inspected for a low resource language pair.
The same architecture is tested for Sanskrit to Hindi translation for which data is sparse.
Dimensionality reduction of word embedding is performed to reduce the memory usage for data storage.
arXiv Detail & Related papers (2020-06-09T17:01:55Z) - Neural Machine Translation for Low-Resourced Indian Languages [4.726777092009554]
Machine translation is an effective approach to convert text to a different language without any human involvement.
In this paper, we have applied NMT on two of the most morphological rich Indian languages, i.e. English-Tamil and English-Malayalam.
We proposed a novel NMT model using Multihead self-attention along with pre-trained Byte-Pair-Encoded (BPE) and MultiBPE embeddings to develop an efficient translation system.
arXiv Detail & Related papers (2020-04-19T17:29:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.