Exploring Text-to-Text Transformers for English to Hinglish Machine
Translation with Synthetic Code-Mixing
- URL: http://arxiv.org/abs/2105.08807v1
- Date: Tue, 18 May 2021 19:50:25 GMT
- Title: Exploring Text-to-Text Transformers for English to Hinglish Machine
Translation with Synthetic Code-Mixing
- Authors: Ganesh Jawahar, El Moatez Billah Nagoudi, Muhammad Abdul-Mageed, Laks
V.S. Lakshmanan
- Abstract summary: We describe models that convert monolingual English text into Hinglish (code-mixed Hindi and English)
Given the recent success of pretrained language models, we also test the utility of two recent Transformer-based encoder-decoder models.
Our models place first in the overall ranking of the English-Hinglish official shared task.
- Score: 19.19256927651015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We describe models focused at the understudied problem of translating between
monolingual and code-mixed language pairs. More specifically, we offer a wide
range of models that convert monolingual English text into Hinglish (code-mixed
Hindi and English). Given the recent success of pretrained language models, we
also test the utility of two recent Transformer-based encoder-decoder models
(i.e., mT5 and mBART) on the task finding both to work well. Given the paucity
of training data for code-mixing, we also propose a dependency-free method for
generating code-mixed texts from bilingual distributed representations that we
exploit for improving language model performance. In particular, armed with
this additional data, we adopt a curriculum learning approach where we first
finetune the language models on synthetic data then on gold code-mixed data. We
find that, although simple, our synthetic code-mixing method is competitive
with (and in some cases is even superior to) several standard methods
(backtranslation, method based on equivalence constraint theory) under a
diverse set of conditions. Our work shows that the mT5 model, finetuned
following the curriculum learning procedure, achieves best translation
performance (12.67 BLEU). Our models place first in the overall ranking of the
English-Hinglish official shared task.
Related papers
- Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi [0.0]
We introduce Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distil-BERT, a model fine-tuned on code-mixed data.
Our two-tiered pre-training approach offers efficient alternatives for multilingual and code-mixed language understanding.
arXiv Detail & Related papers (2023-09-19T02:59:41Z) - The Effect of Alignment Objectives on Code-Switching Translation [0.0]
We are proposing a way of training a single machine translation model that is able to translate monolingual sentences from one language to another.
This model can be considered a bilingual model in the human sense.
arXiv Detail & Related papers (2023-09-10T14:46:31Z) - Leveraging Language Identification to Enhance Code-Mixed Text
Classification [0.7340017786387767]
Existing deep-learning models do not take advantage of the implicit language information in code-mixed text.
Our study aims to improve BERT-based models performance on low-resource Code-Mixed Hindi-English datasets.
arXiv Detail & Related papers (2023-06-08T06:43:10Z) - Comparative Study of Pre-Trained BERT Models for Code-Mixed
Hindi-English Data [0.7874708385247353]
"Code Mixed" refers to the use of more than one language in the same text.
In this work, we focus on low-resource Hindi-English code-mixed language.
We report state-of-the-art results on respective datasets using HingBERT-based models.
arXiv Detail & Related papers (2023-05-25T05:10:28Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Crosslingual Generalization through Multitask Finetuning [80.8822603322471]
Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting.
We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0.
We find finetuning large multilingual language models on English tasks with English prompts allows for task generalization to non-English languages.
arXiv Detail & Related papers (2022-11-03T13:19:32Z) - Investigating Code-Mixed Modern Standard Arabic-Egyptian to English
Machine Translation [6.021269454707625]
We investigate code-mixed Modern Standard Arabic and Egyptian Arabic (MSAEA) into English.
We develop models under different conditions, employing both (i) standard end-to-end sequence-to-sequence (S2S) Transformers trained from scratch and (ii) pre-trained S2S language models (LMs)
We are able to acquire reasonable performance using only MSA-EN parallel data with S2S models trained from scratch and LMs fine-tuned on data from various Arabic dialects.
arXiv Detail & Related papers (2021-05-28T03:38:35Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.