Investigating Code-Mixed Modern Standard Arabic-Egyptian to English
Machine Translation
- URL: http://arxiv.org/abs/2105.13573v1
- Date: Fri, 28 May 2021 03:38:35 GMT
- Title: Investigating Code-Mixed Modern Standard Arabic-Egyptian to English
Machine Translation
- Authors: El Moatez Billah Nagoudi, AbdelRahim Elmadany, Muhammad Abdul-Mageed
- Abstract summary: We investigate code-mixed Modern Standard Arabic and Egyptian Arabic (MSAEA) into English.
We develop models under different conditions, employing both (i) standard end-to-end sequence-to-sequence (S2S) Transformers trained from scratch and (ii) pre-trained S2S language models (LMs)
We are able to acquire reasonable performance using only MSA-EN parallel data with S2S models trained from scratch and LMs fine-tuned on data from various Arabic dialects.
- Score: 6.021269454707625
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent progress in neural machine translation (NMT) has made it possible to
translate successfully between monolingual language pairs where large parallel
data exist, with pre-trained models improving performance even further.
Although there exists work on translating in code-mixed settings (where one of
the pairs includes text from two or more languages), it is still unclear what
recent success in NMT and language modeling exactly means for translating
code-mixed text. We investigate one such context, namely MT from code-mixed
Modern Standard Arabic and Egyptian Arabic (MSAEA) into English. We develop
models under different conditions, employing both (i) standard end-to-end
sequence-to-sequence (S2S) Transformers trained from scratch and (ii)
pre-trained S2S language models (LMs). We are able to acquire reasonable
performance using only MSA-EN parallel data with S2S models trained from
scratch. We also find LMs fine-tuned on data from various Arabic dialects to
help the MSAEA-EN task. Our work is in the context of the Shared Task on
Machine Translation in Code-Switching. Our best model achieves $\bf25.72$ BLEU,
placing us first on the official shared task evaluation for MSAEA-EN.
Related papers
- Towards Zero-Shot Multimodal Machine Translation [64.9141931372384]
We propose a method to bypass the need for fully supervised data to train multimodal machine translation systems.
Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives.
To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese.
arXiv Detail & Related papers (2024-07-18T15:20:31Z) - Improving Language Models Trained on Translated Data with Continual Pre-Training and Dictionary Learning Analysis [3.16714407449467]
We investigate the role of translation and synthetic data in training language models.
We translate TinyStories, a dataset of 2.2M short stories for 3-4 year old children, from English to Arabic using the open NLLB-3B MT model.
To rectify these issues, we pre-train the models with a small dataset of synthesized high-quality Arabic stories.
arXiv Detail & Related papers (2024-05-23T07:53:04Z) - The Effect of Alignment Objectives on Code-Switching Translation [0.0]
We are proposing a way of training a single machine translation model that is able to translate monolingual sentences from one language to another.
This model can be considered a bilingual model in the human sense.
arXiv Detail & Related papers (2023-09-10T14:46:31Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - Multilingual Bidirectional Unsupervised Translation Through Multilingual
Finetuning and Back-Translation [23.401781865904386]
We propose a two-stage approach for training a single NMT model to translate unseen languages both to and from English.
For the first stage, we initialize an encoder-decoder model to pretrained XLM-R and RoBERTa weights, then perform multilingual fine-tuning on parallel data in 40 languages to English.
For the second stage, we leverage this generalization ability to generate synthetic parallel data from monolingual datasets, then bidirectionally train with successive rounds of back-translation.
arXiv Detail & Related papers (2022-09-06T21:20:41Z) - Multilingual Machine Translation Systems from Microsoft for WMT21 Shared
Task [95.06453182273027]
This report describes Microsoft's machine translation systems for the WMT21 shared task on large-scale multilingual machine translation.
Our model submissions to the shared task were with DeltaLMnotefooturlhttps://aka.ms/deltalm, a generic pre-trained multilingual-decoder model.
Our final submissions ranked first on three tracks in terms of the automatic evaluation metric.
arXiv Detail & Related papers (2021-11-03T09:16:17Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z) - Exploring Text-to-Text Transformers for English to Hinglish Machine
Translation with Synthetic Code-Mixing [19.19256927651015]
We describe models that convert monolingual English text into Hinglish (code-mixed Hindi and English)
Given the recent success of pretrained language models, we also test the utility of two recent Transformer-based encoder-decoder models.
Our models place first in the overall ranking of the English-Hinglish official shared task.
arXiv Detail & Related papers (2021-05-18T19:50:25Z) - Zero-shot Cross-lingual Transfer of Neural Machine Translation with
Multilingual Pretrained Encoders [74.89326277221072]
How to improve the cross-lingual transfer of NMT model with multilingual pretrained encoder is under-explored.
We propose SixT, a simple yet effective model for this task.
Our model achieves better performance on many-to-English testsets than CRISS and m2m-100.
arXiv Detail & Related papers (2021-04-18T07:42:45Z) - Lite Training Strategies for Portuguese-English and English-Portuguese
Translation [67.4894325619275]
We investigate the use of pre-trained models, such as T5, for Portuguese-English and English-Portuguese translation tasks.
We propose an adaptation of the English tokenizer to represent Portuguese characters, such as diaeresis, acute and grave accents.
Our results show that our models have a competitive performance to state-of-the-art models while being trained on modest hardware.
arXiv Detail & Related papers (2020-08-20T04:31:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.