Domain Curricula for Code-Switched MT at MixMT 2022
- URL: http://arxiv.org/abs/2210.17463v1
- Date: Mon, 31 Oct 2022 16:41:57 GMT
- Title: Domain Curricula for Code-Switched MT at MixMT 2022
- Authors: Lekan Raheem and Maab Elrashid
- Abstract summary: We present our approach and results for the Code-mixed Machine Translation (MixMT) shared task at WMT 2022.
The task consists of two subtasks, monolingual to code-mixed machine translation (Subtask-1) and code-mixed to monolingual machine translation (Subtask-2).
We jointly learn multiple domains of text by pretraining and fine-tuning, combined with a sentence alignment objective.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In multilingual colloquial settings, it is a habitual occurrence to compose
expressions of text or speech containing tokens or phrases of different
languages, a phenomenon popularly known as code-switching or code-mixing (CMX).
We present our approach and results for the Code-mixed Machine Translation
(MixMT) shared task at WMT 2022: the task consists of two subtasks, monolingual
to code-mixed machine translation (Subtask-1) and code-mixed to monolingual
machine translation (Subtask-2). Most non-synthetic code-mixed data are from
social media but gathering a significant amount of this kind of data would be
laborious and this form of data has more writing variation than other domains,
so for both subtasks, we experimented with data schedules for out-of-domain
data. We jointly learn multiple domains of text by pretraining and fine-tuning,
combined with a sentence alignment objective. We found that switching between
domains caused improved performance in the domains seen earliest during
training, but depleted the performance on the remaining domains. A continuous
training run with strategically dispensed data of different domains showed a
significantly improved performance over fine-tuning.
Related papers
- A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation [52.0964459842176]
Current state-of-the-art dialogue systems heavily rely on extensive training datasets.
We propose a novel data textbfAugmentation framework for textbfMulti-textbfDomain textbfDialogue textbfGeneration, referred to as textbfAMD$2$G.
The AMD$2$G framework consists of a data augmentation process and a two-stage training approach: domain-agnostic training and domain adaptation training.
arXiv Detail & Related papers (2024-06-14T09:52:27Z) - CUNI Submission to MRL 2023 Shared Task on Multi-lingual Multi-task
Information Retrieval [5.97515243922116]
We present the Charles University system for the MRL2023 Shared Task on Multi-lingual Multi-task Information Retrieval.
The goal of the shared task was to develop systems for named entity recognition and question answering in several under-represented languages.
Our solutions to both subtasks rely on the translate-test approach.
arXiv Detail & Related papers (2023-10-25T10:22:49Z) - A General-Purpose Multilingual Document Encoder [9.868221447090855]
We pretrain a massively multilingual document encoder as a hierarchical transformer model (HMDE)
We leverage Wikipedia as a readily available source of comparable documents for creating training data.
We evaluate the effectiveness of HMDE in two arguably most common and prominent cross-lingual document-level tasks.
arXiv Detail & Related papers (2023-05-11T17:55:45Z) - Domain Mismatch Doesn't Always Prevent Cross-Lingual Transfer Learning [51.232774288403114]
Cross-lingual transfer learning has been surprisingly effective in zero-shot cross-lingual classification.
We show that a simple regimen can overcome much of the effect of domain mismatch in cross-lingual transfer.
arXiv Detail & Related papers (2022-11-30T01:24:33Z) - Can Domains Be Transferred Across Languages in Multi-Domain Multilingual
Neural Machine Translation? [52.27798071809941]
This paper investigates whether the domain information can be transferred across languages on the composition of multi-domain and multilingual NMT.
We find that multi-domain multilingual (MDML) NMT can boost zero-shot translation performance up to +10 gains on BLEU.
arXiv Detail & Related papers (2022-10-20T23:13:54Z) - Extreme Multi-Domain, Multi-Task Learning With Unified Text-to-Text
Transfer Transformers [0.0]
We investigated the behavior of multi-domain, multi-task learning using multi-domain text-to-text transfer transformers (MD-T5)
We carried out experiments using three popular training strategies: Bert-style joint pretraining + successive finetuning, GPT-style joint pretraining + successive finetuning, and GPT-style joint pretraining + joint finetuning.
We show that while negative knowledge transfer and catastrophic forgetting are still considerable challenges for all the models, the GPT-style joint pretraining + joint finetuning strategy showed the most promise in multi-domain, multi-task learning.
arXiv Detail & Related papers (2022-09-21T04:21:27Z) - Domain-Specific Text Generation for Machine Translation [7.803471587734353]
We propose a novel approach to domain adaptation leveraging state-of-the-art pretrained language models (LMs) for domain-specific data augmentation.
We employ mixed fine-tuning to train models that significantly improve translation of in-domain texts.
arXiv Detail & Related papers (2022-08-11T16:22:16Z) - Bridging Cross-Lingual Gaps During Leveraging the Multilingual
Sequence-to-Sequence Pretraining for Text Generation [80.16548523140025]
We extend the vanilla pretrain-finetune pipeline with extra code-switching restore task to bridge the gap between the pretrain and finetune stages.
Our approach could narrow the cross-lingual sentence representation distance and improve low-frequency word translation with trivial computational cost.
arXiv Detail & Related papers (2022-04-16T16:08:38Z) - Multilingual Domain Adaptation for NMT: Decoupling Language and Domain
Information with Adapters [66.7986513246294]
We study the compositionality of language and domain adapters in the context of Machine Translation.
We find that in the partial resource scenario a naive combination of domain-specific and language-specific adapters often results in catastrophic forgetting' of the missing languages.
arXiv Detail & Related papers (2021-10-18T18:55:23Z) - FDMT: A Benchmark Dataset for Fine-grained Domain Adaptation in Machine
Translation [53.87731008029645]
We present a real-world fine-grained domain adaptation task in machine translation (FDMT)
The FDMT dataset consists of four sub-domains of information technology: autonomous vehicles, AI education, real-time networks and smart phone.
We make quantitative experiments and deep analyses in this new setting, which benchmarks the fine-grained domain adaptation task.
arXiv Detail & Related papers (2020-12-31T17:15:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.