CharSpan: Utilizing Lexical Similarity to Enable Zero-Shot Machine
Translation for Extremely Low-resource Languages
- URL: http://arxiv.org/abs/2305.05214v2
- Date: Sun, 4 Feb 2024 06:21:03 GMT
- Title: CharSpan: Utilizing Lexical Similarity to Enable Zero-Shot Machine
Translation for Extremely Low-resource Languages
- Authors: Kaushal Kumar Maurya, Rahul Kejriwal, Maunendra Sankar Desarkar, Anoop
Kunchukuttan
- Abstract summary: We address the task of machine translation (MT) from extremely low-resource language (ELRL) to English by leveraging cross-lingual transfer from 'closely-related' high-resource language (HRL)
Many ELRLs share lexical similarities with some HRLs, which presents a novel modeling opportunity.
Existing subword-based neural MT models do not explicitly harness this lexical similarity, as they only implicitly align HRL and ELRL latent embedding space.
We propose a novel, CharSpan, approach based on 'character-span noise augmentation' into the training data of HRL. This serves as a
- Score: 22.51558549091902
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We address the task of machine translation (MT) from extremely low-resource
language (ELRL) to English by leveraging cross-lingual transfer from
'closely-related' high-resource language (HRL). The development of an MT system
for ELRL is challenging because these languages typically lack parallel corpora
and monolingual corpora, and their representations are absent from large
multilingual language models. Many ELRLs share lexical similarities with some
HRLs, which presents a novel modeling opportunity. However, existing
subword-based neural MT models do not explicitly harness this lexical
similarity, as they only implicitly align HRL and ELRL latent embedding space.
To overcome this limitation, we propose a novel, CharSpan, approach based on
'character-span noise augmentation' into the training data of HRL. This serves
as a regularization technique, making the model more robust to 'lexical
divergences' between the HRL and ELRL, thus facilitating effective
cross-lingual transfer. Our method significantly outperformed strong baselines
in zero-shot settings on closely related HRL and ELRL pairs from three diverse
language families, emerging as the state-of-the-art model for ELRLs.
Related papers
- Machine Translation Hallucination Detection for Low and High Resource Languages using Large Language Models [12.447489454369636]
This paper evaluates sentence-level hallucination detection approaches using Large Language Models (LLMs) and semantic similarity within massively multilingual embeddings.
LLMs can achieve performance comparable or even better than previously proposed models, despite not being explicitly trained for any machine translation task.
arXiv Detail & Related papers (2024-07-23T13:40:54Z) - Self-Augmentation Improves Zero-Shot Cross-Lingual Transfer [92.80671770992572]
Cross-lingual transfer is a central task in multilingual NLP.
Earlier efforts on this task use parallel corpora, bilingual dictionaries, or other annotated alignment data.
We propose a simple yet effective method, SALT, to improve the zero-shot cross-lingual transfer.
arXiv Detail & Related papers (2023-09-19T19:30:56Z) - When your Cousin has the Right Connections: Unsupervised Bilingual Lexicon Induction for Related Data-Imbalanced Languages [29.346191691508125]
Unsupervised bilingual lexicon induction is most likely to be useful for low-resource languages, where large datasets are not available.
We show that state-of-the-art BLI methods in the literature exhibit near-zero performance for severely data-imbalanced language pairs.
We present a new method for unsupervised BLI between a related LRL and HRL that only requires inference on a masked language model of the HRL.
arXiv Detail & Related papers (2023-05-23T12:49:21Z) - Transfer to a Low-Resource Language via Close Relatives: The Case Study
on Faroese [54.00582760714034]
Cross-lingual NLP transfer can be improved by exploiting data and models of high-resource languages.
We release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS) and new language models trained on all Scandinavian languages.
arXiv Detail & Related papers (2023-04-18T08:42:38Z) - High-resource Language-specific Training for Multilingual Neural Machine
Translation [109.31892935605192]
We propose the multilingual translation model with the high-resource language-specific training (HLT-MT) to alleviate the negative interference.
Specifically, we first train the multilingual model only with the high-resource pairs and select the language-specific modules at the top of the decoder.
HLT-MT is further trained on all available corpora to transfer knowledge from high-resource languages to low-resource languages.
arXiv Detail & Related papers (2022-07-11T14:33:13Z) - Non-Linear Pairwise Language Mappings for Low-Resource Multilingual
Acoustic Model Fusion [26.728287476234538]
hybrid DNN-HMM acoustic models fusion is proposed in a multilingual setup for the low-resource languages.
Posterior distributions from different monolingual acoustic models against a target language speech signal are fused together.
A separate regression neural network is trained for each source-target language pair to transform posteriors from source acoustic model to the target language.
arXiv Detail & Related papers (2022-07-07T15:56:50Z) - Overlap-based Vocabulary Generation Improves Cross-lingual Transfer
Among Related Languages [18.862296065737347]
We argue that relatedness among languages in a language family along the dimension of lexical overlap may be leveraged to overcome some of the corpora limitations of LRLs.
We propose Overlap BPE, a simple yet effective modification to the BPE vocabulary generation algorithm which enhances overlap across related languages.
arXiv Detail & Related papers (2022-03-03T19:35:24Z) - Can Multilinguality benefit Non-autoregressive Machine Translation? [11.671379480940407]
Non-autoregressive (NAR) machine translation has recently achieved significant improvements, and now outperforms autoregressive (AR) models on some benchmarks.
We present a comprehensive empirical study of multilingual NAR.
We test its capabilities with respect to positive transfer between related languages and negative transfer under capacity constraints.
arXiv Detail & Related papers (2021-12-16T02:20:59Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z) - Improving Target-side Lexical Transfer in Multilingual Neural Machine
Translation [104.10726545151043]
multilingual data has been found more beneficial for NMT models that translate from the LRL to a target language than the ones that translate into the LRLs.
Our experiments show that DecSDE leads to consistent gains of up to 1.8 BLEU on translation from English to four different languages.
arXiv Detail & Related papers (2020-10-04T19:42:40Z) - Cross-lingual Semantic Role Labeling with Model Transfer [49.85316125365497]
Cross-lingual semantic role labeling can be achieved by model transfer under the help of universal features.
We propose an end-to-end SRL model that incorporates a variety of universal features and transfer methods.
arXiv Detail & Related papers (2020-08-24T09:37:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.