NLIP_Lab-IITH Low-Resource MT System for WMT24 Indic MT Shared Task
- URL: http://arxiv.org/abs/2410.03215v1
- Date: Fri, 4 Oct 2024 08:02:43 GMT
- Title: NLIP_Lab-IITH Low-Resource MT System for WMT24 Indic MT Shared Task
- Authors: Pramit Sahoo, Maharaj Brahma, Maunendra Sankar Desarkar,
- Abstract summary: We describe our system for the WMT 24 shared task of Low-Resource Indic Language Translation.
Our primary system is based on language-specific finetuning on a pre-trained model.
We achieve chrF2 scores of 50.6, 42.3, 54.9, and 66.3 on the official public test set for eng$rightarrow$as, eng$rightarrow$kha, eng$rightarrow$lus, eng$rightarrow$mni respectively.
- Score: 9.476463361600826
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we describe our system for the WMT 24 shared task of Low-Resource Indic Language Translation. We consider eng $\leftrightarrow$ {as, kha, lus, mni} as participating language pairs. In this shared task, we explore the finetuning of a pre-trained model motivated by the pre-trained objective of aligning embeddings closer by alignment augmentation \cite{lin-etal-2020-pre} for 22 scheduled Indian languages. Our primary system is based on language-specific finetuning on a pre-trained model. We achieve chrF2 scores of 50.6, 42.3, 54.9, and 66.3 on the official public test set for eng$\rightarrow$as, eng$\rightarrow$kha, eng$\rightarrow$lus, eng$\rightarrow$mni respectively. We also explore multilingual training with/without language grouping and layer-freezing. Our code, models, and generated translations are available here: https://github.com/pramitsahoo/WMT2024-LRILT.
Related papers
- NLIP_Lab-IITH Multilingual MT System for WAT24 MT Shared Task [9.476463361600826]
This paper describes NLIP Lab's multilingual machine translation system for the WAT24 shared task on multilingual Indic MT task.
We explore pre-training for Indic languages using alignment agreement objectives.
We fine-tuned language direction-specific multilingual translation models using small and high-quality seed data.
arXiv Detail & Related papers (2024-10-17T11:18:23Z) - Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - UMBCLU at SemEval-2024 Task 1A and 1C: Semantic Textual Relatedness with and without machine translation [0.09208007322096534]
The aim of SemEval-2024 Task 1 is to develop models for identifying semantic textual relatedness between two sentences.
We develop two STR models, $textitTranSem$ and $textitFineSem$, for the supervised and cross-lingual settings.
arXiv Detail & Related papers (2024-02-20T05:46:29Z) - TSMind: Alibaba and Soochow University's Submission to the WMT22
Translation Suggestion Task [16.986003476984965]
This paper describes the joint submission of Alibaba and Soochow University, TSMind, to the WMT 2022 Shared Task on Translation Suggestion.
Basically, we utilize the model paradigm fine-tuning on the downstream tasks based on large-scale pre-trained models.
Considering the task's condition of limited use of training data, we follow the data augmentation strategies proposed by WeTS to boost our TS model performance.
arXiv Detail & Related papers (2022-11-16T15:43:31Z) - Tencent AI Lab - Shanghai Jiao Tong University Low-Resource Translation
System for the WMT22 Translation Task [49.916963624249355]
This paper describes Tencent AI Lab - Shanghai Jiao Tong University (TAL-SJTU) Low-Resource Translation systems for the WMT22 shared task.
We participate in the general translation task on English$Leftrightarrow$Livonian.
Our system is based on M2M100 with novel techniques that adapt it to the target language pair.
arXiv Detail & Related papers (2022-10-17T04:34:09Z) - CUNI systems for WMT21: Multilingual Low-Resource Translation for
Indo-European Languages Shared Task [0.0]
We show that using joint model for multiple similar language pairs improves upon translation quality in each pair.
We also demonstrate that chararacter-level bilingual models are competitive for very similar language pairs.
arXiv Detail & Related papers (2021-09-20T08:10:39Z) - Emergent Communication Pretraining for Few-Shot Machine Translation [66.48990742411033]
We pretrain neural networks via emergent communication from referential games.
Our key assumption is that grounding communication on images---as a crude approximation of real-world environments---inductively biases the model towards learning natural languages.
arXiv Detail & Related papers (2020-11-02T10:57:53Z) - SJTU-NICT's Supervised and Unsupervised Neural Machine Translation
Systems for the WMT20 News Translation Task [111.91077204077817]
We participated in four translation directions of three language pairs: English-Chinese, English-Polish, and German-Upper Sorbian.
Based on different conditions of language pairs, we have experimented with diverse neural machine translation (NMT) techniques.
In our submissions, the primary systems won the first place on English to Chinese, Polish to English, and German to Upper Sorbian translation directions.
arXiv Detail & Related papers (2020-10-11T00:40:05Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z) - Reusing a Pretrained Language Model on Languages with Limited Corpora
for Unsupervised NMT [129.99918589405675]
We present an effective approach that reuses an LM that is pretrained only on the high-resource language.
The monolingual LM is fine-tuned on both languages and is then used to initialize a UNMT model.
Our approach, RE-LM, outperforms a competitive cross-lingual pretraining model (XLM) in English-Macedonian (En-Mk) and English-Albanian (En-Sq)
arXiv Detail & Related papers (2020-09-16T11:37:10Z) - Automatic Machine Translation Evaluation in Many Languages via Zero-Shot
Paraphrasing [11.564158965143418]
We frame the task of machine translation evaluation as one of scoring machine translation output with a sequence-to-sequence paraphraser.
We propose training the paraphraser as a multilingual NMT system, treating paraphrasing as a zero-shot translation task.
Our method is simple and intuitive, and does not require human judgements for training.
arXiv Detail & Related papers (2020-04-30T03:32:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.