Alternative Input Signals Ease Transfer in Multilingual Machine
Translation
- URL: http://arxiv.org/abs/2110.07804v1
- Date: Fri, 15 Oct 2021 01:56:46 GMT
- Title: Alternative Input Signals Ease Transfer in Multilingual Machine
Translation
- Authors: Simeng Sun, Angela Fan, James Cross, Vishrav Chaudhary, Chau Tran,
Philipp Koehn, Francisco Guzman
- Abstract summary: We tackle inhibited transfer by augmenting the training data with alternative signals that unify different writing systems.
We test these signals on Indic and Turkic languages, two language families where the writing systems differ but languages still share common features.
- Score: 21.088829932208945
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work in multilingual machine translation (MMT) has focused on the
potential of positive transfer between languages, particularly cases where
higher-resourced languages can benefit lower-resourced ones. While training an
MMT model, the supervision signals learned from one language pair can be
transferred to the other via the tokens shared by multiple source languages.
However, the transfer is inhibited when the token overlap among source
languages is small, which manifests naturally when languages use different
writing systems. In this paper, we tackle inhibited transfer by augmenting the
training data with alternative signals that unify different writing systems,
such as phonetic, romanized, and transliterated input. We test these signals on
Indic and Turkic languages, two language families where the writing systems
differ but languages still share common features. Our results indicate that a
straightforward multi-source self-ensemble -- training a model on a mixture of
various signals and ensembling the outputs of the same model fed with different
signals during inference, outperforms strong ensemble baselines by 1.3 BLEU
points on both language families. Further, we find that incorporating
alternative inputs via self-ensemble can be particularly effective when
training set is small, leading to +5 BLEU when only 5% of the total training
data is accessible. Finally, our analysis demonstrates that including
alternative signals yields more consistency and translates named entities more
accurately, which is crucial for increased factuality of automated systems.
Related papers
- Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Automatic Discrimination of Human and Neural Machine Translation in
Multilingual Scenarios [4.631167282648452]
We tackle the task of automatically discriminating between human and machine translations.
We perform experiments in a multilingual setting, considering multiple languages and multilingual pretrained language models.
arXiv Detail & Related papers (2023-05-31T11:41:24Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Languages You Know Influence Those You Learn: Impact of Language
Characteristics on Multi-Lingual Text-to-Text Transfer [4.554080966463776]
Multi-lingual language models (LM) have been remarkably successful in enabling natural language tasks in low-resource languages.
We try to better understand how such models, specifically mT5, transfer *any* linguistic and semantic knowledge across languages.
A key finding of this work is that similarity of syntax, morphology and phonology are good predictors of cross-lingual transfer.
arXiv Detail & Related papers (2022-12-04T07:22:21Z) - High-resource Language-specific Training for Multilingual Neural Machine
Translation [109.31892935605192]
We propose the multilingual translation model with the high-resource language-specific training (HLT-MT) to alleviate the negative interference.
Specifically, we first train the multilingual model only with the high-resource pairs and select the language-specific modules at the top of the decoder.
HLT-MT is further trained on all available corpora to transfer knowledge from high-resource languages to low-resource languages.
arXiv Detail & Related papers (2022-07-11T14:33:13Z) - Bitext Mining Using Distilled Sentence Representations for Low-Resource
Languages [12.00637655338665]
We study very low-resource languages and handle 50 African languages, many of which are not covered by any other model.
We train sentence encoders, mine bitexts, and validate the bitexts by training NMT systems.
For these languages, we train sentence encoders, mine bitexts, and validate the bitexts by training NMT systems.
arXiv Detail & Related papers (2022-05-25T10:53:24Z) - Cross-lingual Transfer for Speech Processing using Acoustic Language
Similarity [81.51206991542242]
Cross-lingual transfer offers a compelling way to help bridge this digital divide.
Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages.
We propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages.
arXiv Detail & Related papers (2021-11-02T01:55:17Z) - Improving Zero-Shot Cross-Lingual Transfer Learning via Robust Training [45.48003947488825]
We study two widely used robust training methods: adversarial training and randomized smoothing.
The experimental results demonstrate that robust training can improve zero-shot cross-lingual transfer for text classification.
arXiv Detail & Related papers (2021-04-17T21:21:53Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.