A Morphologically-Aware Dictionary-based Data Augmentation Technique for
Machine Translation of Under-Represented Languages
- URL: http://arxiv.org/abs/2402.01939v1
- Date: Fri, 2 Feb 2024 22:25:44 GMT
- Title: A Morphologically-Aware Dictionary-based Data Augmentation Technique for
Machine Translation of Under-Represented Languages
- Authors: Md Mahfuz Ibn Alam, Sina Ahmadi and Antonios Anastasopoulos
- Abstract summary: We propose strategies to synthesize parallel data relying on morpho-syntactic information and using bilingual lexicons.
Our methodology adheres to a realistic scenario backed by the small parallel seed data.
It is linguistically informed, as it aims to create augmented data that is more likely to be grammatically correct.
- Score: 31.18983138590214
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The availability of parallel texts is crucial to the performance of machine
translation models. However, most of the world's languages face the predominant
challenge of data scarcity. In this paper, we propose strategies to synthesize
parallel data relying on morpho-syntactic information and using bilingual
lexicons along with a small amount of seed parallel data. Our methodology
adheres to a realistic scenario backed by the small parallel seed data. It is
linguistically informed, as it aims to create augmented data that is more
likely to be grammatically correct. We analyze how our synthetic data can be
combined with raw parallel data and demonstrate a consistent improvement in
performance in our experiments on 14 languages (28 English <-> X pairs) ranging
from well- to very low-resource ones. Our method leads to improvements even
when using only five seed sentences and a bilingual lexicon.
Related papers
- Cross-lingual Transfer or Machine Translation? On Data Augmentation for
Monolingual Semantic Textual Similarity [2.422759879602353]
Cross-lingual transfer of Wikipedia data exhibits improved performance for monolingual STS.
We find a superiority of the Wikipedia domain over the NLI domain for these languages, in contrast to prior studies that focused on NLI as training data.
arXiv Detail & Related papers (2024-03-08T12:28:15Z) - Optimal Transport Posterior Alignment for Cross-lingual Semantic Parsing [68.47787275021567]
Cross-lingual semantic parsing transfers parsing capability from a high-resource language (e.g., English) to low-resource languages with scarce training data.
We propose a new approach to cross-lingual semantic parsing by explicitly minimizing cross-lingual divergence between latent variables using Optimal Transport.
arXiv Detail & Related papers (2023-07-09T04:52:31Z) - Bilex Rx: Lexical Data Augmentation for Massively Multilingual Machine
Translation [33.6064740446337]
This work explores a cheap and abundant resource to combat this problem: bilingual lexica.
We test the efficacy of bilingual lexica in a real-world set-up, on 200-language translation models trained on web-crawled text.
We present several findings: (1) using lexical data augmentation, we demonstrate sizable performance gains for unsupervised translation; (2) we compare several families of data augmentation, demonstrating that they yield similar improvements; and (3) we demonstrate the importance of carefully curated lexica over larger, noisier ones.
arXiv Detail & Related papers (2023-03-27T14:54:43Z) - On the Role of Parallel Data in Cross-lingual Transfer Learning [30.737717433111776]
We examine the usage of unsupervised machine translation to generate synthetic parallel data.
We find that even model generated parallel data can be useful for downstream tasks.
Our findings suggest that existing multilingual models do not exploit the full potential of monolingual data.
arXiv Detail & Related papers (2022-12-20T11:23:04Z) - Language Agnostic Multilingual Information Retrieval with Contrastive
Learning [59.26316111760971]
We present an effective method to train multilingual information retrieval systems.
We leverage parallel and non-parallel corpora to improve the pretrained multilingual language models.
Our model can work well even with a small number of parallel sentences.
arXiv Detail & Related papers (2022-10-12T23:53:50Z) - Bridging the Data Gap between Training and Inference for Unsupervised
Neural Machine Translation [49.916963624249355]
A UNMT model is trained on the pseudo parallel data with translated source, and natural source sentences in inference.
The source discrepancy between training and inference hinders the translation performance of UNMT models.
We propose an online self-training approach, which simultaneously uses the pseudo parallel data natural source, translated target to mimic the inference scenario.
arXiv Detail & Related papers (2022-03-16T04:50:27Z) - Cross-language Sentence Selection via Data Augmentation and Rationale
Training [22.106577427237635]
It uses data augmentation and negative sampling techniques on noisy parallel sentence data to learn a cross-lingual embedding-based query relevance model.
Results show that this approach performs as well as or better than multiple state-of-the-art machine translation + monolingual retrieval systems trained on the same parallel data.
arXiv Detail & Related papers (2021-06-04T07:08:47Z) - Self-Training Sampling with Monolingual Data Uncertainty for Neural
Machine Translation [98.83925811122795]
We propose to improve the sampling procedure by selecting the most informative monolingual sentences to complement the parallel data.
We compute the uncertainty of monolingual sentences using the bilingual dictionary extracted from the parallel data.
Experimental results on large-scale WMT English$Rightarrow$German and English$Rightarrow$Chinese datasets demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2021-06-02T05:01:36Z) - Meta Back-translation [111.87397401837286]
We propose a novel method to generate pseudo-parallel data from a pre-trained back-translation model.
Our method is a meta-learning algorithm which adapts a pre-trained back-translation model so that the pseudo-parallel data it generates would train a forward-translation model to do well on a validation set.
arXiv Detail & Related papers (2021-02-15T20:58:32Z) - Synthesizing Monolingual Data for Neural Machine Translation [22.031658738184166]
In neural machine translation (NMT), monolingual data in the target language are usually exploited to synthesize additional training parallel data.
Large monolingual data in the target domains or languages are not always available to generate large synthetic parallel data.
We propose a new method to generate large synthetic parallel data leveraging very small monolingual data in a specific domain.
arXiv Detail & Related papers (2021-01-29T08:17:40Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.