Gui at MixMT 2022 : English-Hinglish: An MT approach for translation of
code mixed data
- URL: http://arxiv.org/abs/2210.12215v1
- Date: Fri, 21 Oct 2022 19:48:18 GMT
- Title: Gui at MixMT 2022 : English-Hinglish: An MT approach for translation of
code mixed data
- Authors: Akshat Gahoi, Jayant Duneja, Anshul Padhi, Shivam Mangale, Saransh
Rajput, Tanvi Kamble, Dipti Misra Sharma, Vasudeva Varma
- Abstract summary: We try to tackle the same for both English + Hindi to Hinglish and Hinglish to English.
To our knowledge, we achieved one of the top ROUGE-L and WER scores for the first task of Monolingual to Code-Mixed machine translation.
- Score: 13.187116325089951
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Code-mixed machine translation has become an important task in multilingual
communities and extending the task of machine translation to code mixed data
has become a common task for these languages. In the shared tasks of WMT 2022,
we try to tackle the same for both English + Hindi to Hinglish and Hinglish to
English. The first task dealt with both Roman and Devanagari script as we had
monolingual data in both English and Hindi whereas the second task only had
data in Roman script. To our knowledge, we achieved one of the top ROUGE-L and
WER scores for the first task of Monolingual to Code-Mixed machine translation.
In this paper, we discuss the use of mBART with some special pre-processing and
post-processing (transliteration from Devanagari to Roman) for the first task
in detail and the experiments that we performed for the second task of
translating code-mixed Hinglish to monolingual English.
Related papers
- Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - Transformer-based Model for Word Level Language Identification in
Code-mixed Kannada-English Texts [55.41644538483948]
We propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts.
The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
arXiv Detail & Related papers (2022-11-26T02:39:19Z) - Crosslingual Generalization through Multitask Finetuning [80.8822603322471]
Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting.
We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0.
We find finetuning large multilingual language models on English tasks with English prompts allows for task generalization to non-English languages.
arXiv Detail & Related papers (2022-11-03T13:19:32Z) - The Effect of Normalization for Bi-directional Amharic-English Neural
Machine Translation [53.907805815477126]
This paper presents the first relatively large-scale Amharic-English parallel sentence dataset.
We build bi-directional Amharic-English translation models by fine-tuning the existing Facebook M2M100 pre-trained model.
The results show that the normalization of Amharic homophone characters increases the performance of Amharic-English machine translation in both directions.
arXiv Detail & Related papers (2022-10-27T07:18:53Z) - The University of Edinburgh's Submission to the WMT22 Code-Mixing Shared
Task (MixMT) [2.9681323891560303]
The University of Edinburgh participated in the WMT22 shared task on code-mixed translation.
This consists of two subtasks: generating code-mixed Hindi/English (Hinglish) text generation from parallel Hindi and English sentences and ii) machine translation from Hinglish to English.
Our systems for both subtasks were one of the overall top-performing submissions.
arXiv Detail & Related papers (2022-10-20T14:40:10Z) - BITS Pilani at HinglishEval: Quality Evaluation for Code-Mixed Hinglish
Text Using Transformers [1.181206257787103]
This paper aims to determine the factors influencing the quality of Code-Mixed text data generated by the system.
For the HinglishEval task, the proposed model uses multi-lingual BERT to find the similarity between synthetically generated and human-generated sentences.
arXiv Detail & Related papers (2022-06-17T10:36:50Z) - Bridging Cross-Lingual Gaps During Leveraging the Multilingual
Sequence-to-Sequence Pretraining for Text Generation [80.16548523140025]
We extend the vanilla pretrain-finetune pipeline with extra code-switching restore task to bridge the gap between the pretrain and finetune stages.
Our approach could narrow the cross-lingual sentence representation distance and improve low-frequency word translation with trivial computational cost.
arXiv Detail & Related papers (2022-04-16T16:08:38Z) - CALCS 2021 Shared Task: Machine Translation for Code-Switched Data [27.28423961505655]
We address machine translation for code-switched social media data.
We create a community shared task.
For the supervised setting, participants are challenged to translate English into Hindi-English (Eng-Hinglish) in a single direction.
For the unsupervised setting, we provide the following language pairs: English and Spanish-English (Eng-Spanglish), and English and Modern Standard Arabic-Egyptian Arabic (Eng-MSAEA) in both directions.
arXiv Detail & Related papers (2022-02-19T15:39:34Z) - Prabhupadavani: A Code-mixed Speech Translation Data for 25 Languages [12.30099599834466]
Prabhupadavani is a multilingual code-mixed ST dataset for 25 languages.
It contains 94 hours of speech by 130+ speakers, manually aligned with corresponding text in the target language.
This data also can be used for a code-mixed machine translation task.
arXiv Detail & Related papers (2022-01-27T09:24:36Z) - Investigating Code-Mixed Modern Standard Arabic-Egyptian to English
Machine Translation [6.021269454707625]
We investigate code-mixed Modern Standard Arabic and Egyptian Arabic (MSAEA) into English.
We develop models under different conditions, employing both (i) standard end-to-end sequence-to-sequence (S2S) Transformers trained from scratch and (ii) pre-trained S2S language models (LMs)
We are able to acquire reasonable performance using only MSA-EN parallel data with S2S models trained from scratch and LMs fine-tuned on data from various Arabic dialects.
arXiv Detail & Related papers (2021-05-28T03:38:35Z) - SJTU-NICT's Supervised and Unsupervised Neural Machine Translation
Systems for the WMT20 News Translation Task [111.91077204077817]
We participated in four translation directions of three language pairs: English-Chinese, English-Polish, and German-Upper Sorbian.
Based on different conditions of language pairs, we have experimented with diverse neural machine translation (NMT) techniques.
In our submissions, the primary systems won the first place on English to Chinese, Polish to English, and German to Upper Sorbian translation directions.
arXiv Detail & Related papers (2020-10-11T00:40:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.