The University of Edinburgh's Submission to the WMT22 Code-Mixing Shared
Task (MixMT)
- URL: http://arxiv.org/abs/2210.11309v1
- Date: Thu, 20 Oct 2022 14:40:10 GMT
- Title: The University of Edinburgh's Submission to the WMT22 Code-Mixing Shared
Task (MixMT)
- Authors: Faheem Kirefu, Vivek Iyer, Pinzhen Chen and Laurie Burchell
- Abstract summary: The University of Edinburgh participated in the WMT22 shared task on code-mixed translation.
This consists of two subtasks: generating code-mixed Hindi/English (Hinglish) text generation from parallel Hindi and English sentences and ii) machine translation from Hinglish to English.
Our systems for both subtasks were one of the overall top-performing submissions.
- Score: 2.9681323891560303
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The University of Edinburgh participated in the WMT22 shared task on
code-mixed translation. This consists of two subtasks: i) generating code-mixed
Hindi/English (Hinglish) text generation from parallel Hindi and English
sentences and ii) machine translation from Hinglish to English. As both
subtasks are considered low-resource, we focused our efforts on careful data
generation and curation, especially the use of backtranslation from monolingual
resources. For subtask 1 we explored the effects of constrained decoding on
English and transliterated subwords in order to produce Hinglish. For subtask
2, we investigated different pretraining techniques, namely comparing simple
initialisation from existing machine translation models and aligned
augmentation. For both subtasks, we found that our baseline systems worked
best. Our systems for both subtasks were one of the overall top-performing
submissions.
Related papers
- SemEval-2024 Task 8: Multidomain, Multimodel and Multilingual Machine-Generated Text Detection [68.858931667807]
Subtask A is a binary classification task determining whether a text is written by a human or generated by a machine.
Subtask B is to detect the exact source of a text, discerning whether it is written by a human or generated by a specific LLM.
Subtask C aims to identify the changing point within a text, at which the authorship transitions from human to machine.
arXiv Detail & Related papers (2024-04-22T13:56:07Z) - Findings of the WMT 2022 Shared Task on Translation Suggestion [63.457874930232926]
We report the result of the first edition of the WMT shared task on Translation Suggestion.
The task aims to provide alternatives for specific words or phrases given the entire documents generated by machine translation (MT)
It consists two sub-tasks, namely, the naive translation suggestion and translation suggestion with hints.
arXiv Detail & Related papers (2022-11-30T03:48:36Z) - The Effect of Normalization for Bi-directional Amharic-English Neural
Machine Translation [53.907805815477126]
This paper presents the first relatively large-scale Amharic-English parallel sentence dataset.
We build bi-directional Amharic-English translation models by fine-tuning the existing Facebook M2M100 pre-trained model.
The results show that the normalization of Amharic homophone characters increases the performance of Amharic-English machine translation in both directions.
arXiv Detail & Related papers (2022-10-27T07:18:53Z) - Gui at MixMT 2022 : English-Hinglish: An MT approach for translation of
code mixed data [13.187116325089951]
We try to tackle the same for both English + Hindi to Hinglish and Hinglish to English.
To our knowledge, we achieved one of the top ROUGE-L and WER scores for the first task of Monolingual to Code-Mixed machine translation.
arXiv Detail & Related papers (2022-10-21T19:48:18Z) - Synergy with Translation Artifacts for Training and Inference in
Multilingual Tasks [11.871523410051527]
This paper shows that combining the use of both translations simultaneously can synergize the results on various multilingual sentence classification tasks.
We propose a cross-lingual fine-tuning algorithm called MUSC, which uses SupCon and MixUp jointly and improves the performance.
arXiv Detail & Related papers (2022-10-18T04:55:24Z) - Bridging Cross-Lingual Gaps During Leveraging the Multilingual
Sequence-to-Sequence Pretraining for Text Generation [80.16548523140025]
We extend the vanilla pretrain-finetune pipeline with extra code-switching restore task to bridge the gap between the pretrain and finetune stages.
Our approach could narrow the cross-lingual sentence representation distance and improve low-frequency word translation with trivial computational cost.
arXiv Detail & Related papers (2022-04-16T16:08:38Z) - Investigating Code-Mixed Modern Standard Arabic-Egyptian to English
Machine Translation [6.021269454707625]
We investigate code-mixed Modern Standard Arabic and Egyptian Arabic (MSAEA) into English.
We develop models under different conditions, employing both (i) standard end-to-end sequence-to-sequence (S2S) Transformers trained from scratch and (ii) pre-trained S2S language models (LMs)
We are able to acquire reasonable performance using only MSA-EN parallel data with S2S models trained from scratch and LMs fine-tuned on data from various Arabic dialects.
arXiv Detail & Related papers (2021-05-28T03:38:35Z) - Exploring Text-to-Text Transformers for English to Hinglish Machine
Translation with Synthetic Code-Mixing [19.19256927651015]
We describe models that convert monolingual English text into Hinglish (code-mixed Hindi and English)
Given the recent success of pretrained language models, we also test the utility of two recent Transformer-based encoder-decoder models.
Our models place first in the overall ranking of the English-Hinglish official shared task.
arXiv Detail & Related papers (2021-05-18T19:50:25Z) - Unsupervised Bitext Mining and Translation via Self-trained Contextual
Embeddings [51.47607125262885]
We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text.
We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training.
We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (absolute) in F1 scores over previous unsupervised methods.
arXiv Detail & Related papers (2020-10-15T14:04:03Z) - SJTU-NICT's Supervised and Unsupervised Neural Machine Translation
Systems for the WMT20 News Translation Task [111.91077204077817]
We participated in four translation directions of three language pairs: English-Chinese, English-Polish, and German-Upper Sorbian.
Based on different conditions of language pairs, we have experimented with diverse neural machine translation (NMT) techniques.
In our submissions, the primary systems won the first place on English to Chinese, Polish to English, and German to Upper Sorbian translation directions.
arXiv Detail & Related papers (2020-10-11T00:40:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.