NLIP_Lab-IITH Multilingual MT System for WAT24 MT Shared Task
- URL: http://arxiv.org/abs/2410.13443v1
- Date: Thu, 17 Oct 2024 11:18:23 GMT
- Title: NLIP_Lab-IITH Multilingual MT System for WAT24 MT Shared Task
- Authors: Maharaj Brahma, Pramit Sahoo, Maunendra Sankar Desarkar,
- Abstract summary: This paper describes NLIP Lab's multilingual machine translation system for the WAT24 shared task on multilingual Indic MT task.
We explore pre-training for Indic languages using alignment agreement objectives.
We fine-tuned language direction-specific multilingual translation models using small and high-quality seed data.
- Score: 9.476463361600826
- License:
- Abstract: This paper describes NLIP Lab's multilingual machine translation system for the WAT24 shared task on multilingual Indic MT task for 22 scheduled languages belonging to 4 language families. We explore pre-training for Indic languages using alignment agreement objectives. We utilize bi-lingual dictionaries to substitute words from source sentences. Furthermore, we fine-tuned language direction-specific multilingual translation models using small and high-quality seed data. Our primary submission is a 243M parameters multilingual translation model covering 22 Indic languages. In the IN22-Gen benchmark, we achieved an average chrF++ score of 46.80 and 18.19 BLEU score for the En-Indic direction. In the Indic-En direction, we achieved an average chrF++ score of 56.34 and 30.82 BLEU score. In the In22-Conv benchmark, we achieved an average chrF++ score of 43.43 and BLEU score of 16.58 in the En-Indic direction, and in the Indic-En direction, we achieved an average of 52.44 and 29.77 for chrF++ and BLEU respectively. Our model\footnote{Our code and models are available at \url{https://github.com/maharajbrahma/WAT2024-MultiIndicMT}} is competitive with IndicTransv1 (474M parameter model).
Related papers
- NLIP_Lab-IITH Low-Resource MT System for WMT24 Indic MT Shared Task [9.476463361600826]
We describe our system for the WMT 24 shared task of Low-Resource Indic Language Translation.
Our primary system is based on language-specific finetuning on a pre-trained model.
We achieve chrF2 scores of 50.6, 42.3, 54.9, and 66.3 on the official public test set for eng$rightarrow$as, eng$rightarrow$kha, eng$rightarrow$lus, eng$rightarrow$mni respectively.
arXiv Detail & Related papers (2024-10-04T08:02:43Z) - Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - Machine Translation for Ge'ez Language [0.0]
Machine translation for low-resource languages such as Ge'ez faces challenges such as out-of-vocabulary words, domain mismatches, and lack of labeled training data.
We develop a multilingual neural machine translation (MNMT) model based on languages relatedness.
We also experiment with using GPT-3.5, a state-of-the-art LLM, for few-shot translation with fuzzy matches.
arXiv Detail & Related papers (2023-11-24T14:55:23Z) - Assessing Translation capabilities of Large Language Models involving
English and Indian Languages [4.067706269490143]
We explore the multilingual capabilities of large language models by using machine translation as a task involving English and 22 Indian languages.
We fine-tune these large language models using parameter efficient fine-tuning methods such as LoRA and additionally with full fine-tuning.
Our results demonstrate significant progress, with average BLEU scores of 13.42, 15.93, 12.13, 12.30, and 12.07, as well as CHRF scores of 43.98, 46.99, 42.55, 42.42, and 45.39, respectively.
arXiv Detail & Related papers (2023-11-15T18:58:19Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z) - IndicTrans2: Towards High-Quality and Accessible Machine Translation
Models for all 22 Scheduled Indian Languages [37.758476568195256]
India has a rich linguistic landscape with languages from 4 major language families spoken by over a billion people.
22 of these languages are listed in the Constitution of India (referred to as scheduled languages)
arXiv Detail & Related papers (2023-05-25T17:57:43Z) - Tencent AI Lab - Shanghai Jiao Tong University Low-Resource Translation
System for the WMT22 Translation Task [49.916963624249355]
This paper describes Tencent AI Lab - Shanghai Jiao Tong University (TAL-SJTU) Low-Resource Translation systems for the WMT22 shared task.
We participate in the general translation task on English$Leftrightarrow$Livonian.
Our system is based on M2M100 with novel techniques that adapt it to the target language pair.
arXiv Detail & Related papers (2022-10-17T04:34:09Z) - MIA 2022 Shared Task: Evaluating Cross-lingual Open-Retrieval Question
Answering for 16 Diverse Languages [54.002969723086075]
We evaluate cross-lingual open-retrieval question answering systems in 16 typologically diverse languages.
The best system leveraging iteratively mined diverse negative examples achieves 32.2 F1, outperforming our baseline by 4.5 points.
The second best system uses entity-aware contextualized representations for document retrieval, and achieves significant improvements in Tamil (20.8 F1), whereas most of the other systems yield nearly zero scores.
arXiv Detail & Related papers (2022-07-02T06:54:10Z) - Samanantar: The Largest Publicly Available Parallel Corpora Collection
for 11 Indic Languages [4.3857077920223295]
Samanantar is the largest publicly available parallel corpora collection for Indic languages.
The collection contains a total of 49.7 million sentence pairs between English and 11 Indic languages.
arXiv Detail & Related papers (2021-04-12T16:18:20Z) - Unsupervised Transfer Learning in Multilingual Neural Machine
Translation with Cross-Lingual Word Embeddings [72.69253034282035]
We exploit a language independent multilingual sentence representation to easily generalize to a new language.
Blindly decoding from Portuguese using a basesystem containing several Romance languages we achieve scores of 36.4 BLEU for Portuguese-English and 12.8 BLEU for Russian-English.
We explore a more practical adaptation approach through non-iterative backtranslation, exploiting our model's ability to produce high quality translations.
arXiv Detail & Related papers (2021-03-11T14:22:08Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.