University of Cape Town's WMT22 System: Multilingual Machine Translation
for Southern African Languages
- URL: http://arxiv.org/abs/2210.11757v1
- Date: Fri, 21 Oct 2022 06:31:24 GMT
- Title: University of Cape Town's WMT22 System: Multilingual Machine Translation
for Southern African Languages
- Authors: Khalid N. Elmadani, Francois Meyer, Jan Buys
- Abstract summary: Our system is a single multilingual translation model that translates between English and 8 South / South East African languages.
We used several techniques suited for low-resource machine translation (MT), including overlap BPE, back-translation, synthetic training data generation, and adding more translation directions during training.
Our results show the value of these techniques, especially for directions where very little or no bilingual training data is available.
- Score: 6.1394388820078625
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The paper describes the University of Cape Town's submission to the
constrained track of the WMT22 Shared Task: Large-Scale Machine Translation
Evaluation for African Languages. Our system is a single multilingual
translation model that translates between English and 8 South / South East
African Languages, as well as between specific pairs of the African languages.
We used several techniques suited for low-resource machine translation (MT),
including overlap BPE, back-translation, synthetic training data generation,
and adding more translation directions during training. Our results show the
value of these techniques, especially for directions where very little or no
bilingual training data is available.
Related papers
- Retrieval-Augmented Machine Translation with Unstructured Knowledge [74.84236945680503]
Retrieval-augmented generation (RAG) introduces additional information to enhance large language models (LLMs)
In machine translation (MT), previous work typically retrieves in-context examples from paired MT corpora, or domain-specific knowledge from knowledge graphs.
In this paper, we study retrieval-augmented MT using unstructured documents.
arXiv Detail & Related papers (2024-12-05T17:00:32Z) - Toucan: Many-to-Many Translation for 150 African Language Pairs [18.994098153839996]
We introduce two language models (LMs), Cheetah-1.2B and Cheetah-3.7B, with 1.2 billion and 3.7 billion parameters respectively.
Next, we finetune the aforementioned models to create toucan, an Afrocentric machine translation model designed to support 156 African language pairs.
Toucan significantly outperforms other models, showcasing its remarkable performance on MT for African languages.
arXiv Detail & Related papers (2024-07-05T18:12:19Z) - A Tulu Resource for Machine Translation [3.038642416291856]
We present the first parallel dataset for English-Tulu translation.
Tulu is spoken by approximately 2.5 million individuals in southwestern India.
Our English-Tulu system, trained without using parallel English-Tulu data, outperforms Google Translate by 19 BLEU points.
arXiv Detail & Related papers (2024-03-28T04:30:07Z) - Eliciting the Translation Ability of Large Language Models via Multilingual Finetuning with Translation Instructions [68.01449013641532]
Large-scale Pretrained Language Models (LLMs) have shown strong abilities in multilingual translations.
We present a detailed analysis by finetuning a multilingual pretrained language model, XGLM-7B, to perform multilingual translation.
arXiv Detail & Related papers (2023-05-24T12:00:24Z) - The Effect of Normalization for Bi-directional Amharic-English Neural
Machine Translation [53.907805815477126]
This paper presents the first relatively large-scale Amharic-English parallel sentence dataset.
We build bi-directional Amharic-English translation models by fine-tuning the existing Facebook M2M100 pre-trained model.
The results show that the normalization of Amharic homophone characters increases the performance of Amharic-English machine translation in both directions.
arXiv Detail & Related papers (2022-10-27T07:18:53Z) - Tencent's Multilingual Machine Translation System for WMT22 Large-Scale
African Languages [47.06332023467713]
This paper describes Tencent's multilingual machine translation systems for the WMT22 shared task on Large-Scale Machine Translation Evaluation for African Languages.
We adopt data augmentation, distributionally robust optimization, and language family grouping, respectively, to develop our multilingual neural machine translation (MNMT) models.
arXiv Detail & Related papers (2022-10-18T07:22:29Z) - Building Multilingual Machine Translation Systems That Serve Arbitrary
X-Y Translations [75.73028056136778]
We show how to practically build MNMT systems that serve arbitrary X-Y translation directions.
We also examine our proposed approach in an extremely large-scale data setting to accommodate practical deployment scenarios.
arXiv Detail & Related papers (2022-06-30T02:18:15Z) - MMTAfrica: Multilingual Machine Translation for African Languages [0.010742675209112621]
We introduce MMTAfrica, the first many-to-many multilingual translation system for six African languages.
For multilingual translation concerning African languages, we introduce a novel backtranslation and reconstruction objective, BT&REC.
We report improvements from MMTAfrica over the FLORES 101 benchmarks.
arXiv Detail & Related papers (2022-04-08T21:42:44Z) - AfroMT: Pretraining Strategies and Reproducible Benchmarks for
Translation of 8 African Languages [94.75849612191546]
AfroMT is a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages.
We develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages.
We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines.
arXiv Detail & Related papers (2021-09-10T07:45:21Z) - SJTU-NICT's Supervised and Unsupervised Neural Machine Translation
Systems for the WMT20 News Translation Task [111.91077204077817]
We participated in four translation directions of three language pairs: English-Chinese, English-Polish, and German-Upper Sorbian.
Based on different conditions of language pairs, we have experimented with diverse neural machine translation (NMT) techniques.
In our submissions, the primary systems won the first place on English to Chinese, Polish to English, and German to Upper Sorbian translation directions.
arXiv Detail & Related papers (2020-10-11T00:40:05Z) - Neural Machine Translation for Low-Resourced Indian Languages [4.726777092009554]
Machine translation is an effective approach to convert text to a different language without any human involvement.
In this paper, we have applied NMT on two of the most morphological rich Indian languages, i.e. English-Tamil and English-Malayalam.
We proposed a novel NMT model using Multihead self-attention along with pre-trained Byte-Pair-Encoded (BPE) and MultiBPE embeddings to develop an efficient translation system.
arXiv Detail & Related papers (2020-04-19T17:29:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.