Related papers: AUGVIC: Exploiting BiText Vicinity for Low-Resource NMT

AUGVIC: Exploiting BiText Vicinity for Low-Resource NMT

URL: http://arxiv.org/abs/2106.05141v1
Date: Wed, 9 Jun 2021 15:29:18 GMT
Title: AUGVIC: Exploiting BiText Vicinity for Low-Resource NMT
Authors: Tasnim Mohiuddin, M Saiful Bari, Shafiq Joty
Abstract summary: AUGVIC is a novel data augmentation framework for low-resource NMT. It exploits the vicinal samples of the given bitext without using any extra monolingual data explicitly. We show that AUGVIC helps to attenuate the discrepancies between relevant and distant-domain monolingual data in traditional back-translation.
Score: 9.797319790710711
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The success of Neural Machine Translation (NMT) largely depends on the availability of large bitext training corpora. Due to the lack of such large corpora in low-resource language pairs, NMT systems often exhibit poor performance. Extra relevant monolingual data often helps, but acquiring it could be quite expensive, especially for low-resource languages. Moreover, domain mismatch between bitext (train/test) and monolingual data might degrade the performance. To alleviate such issues, we propose AUGVIC, a novel data augmentation framework for low-resource NMT which exploits the vicinal samples of the given bitext without using any extra monolingual data explicitly. It can diversify the in-domain bitext data with finer level control. Through extensive experiments on four low-resource language pairs comprising data from different domains, we have shown that our method is comparable to the traditional back-translation that uses extra in-domain monolingual data. When we combine the synthetic parallel data generated from AUGVIC with the ones from the extra monolingual data, we achieve further improvements. We show that AUGVIC helps to attenuate the discrepancies between relevant and distant-domain monolingual data in traditional back-translation. To understand the contributions of different components of AUGVIC, we perform an in-depth framework analysis.

Related papers

Ready to Translate, Not to Represent? Bias and Performance Gaps in Multilingual LLMs Across Language Families and Domains [6.357124887141297]
Large Language Models (LLMs) have redefined Machine Translation (MT)<n>LLMs often exhibit uneven performance across language families and specialized domains.<n>We introduce Translation Tangles, a unified framework and dataset for evaluating the translation quality and fairness of open-source LLMs.
arXiv Detail & Related papers (2025-10-09T07:28:30Z)
End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled Data [5.950263765640278]
This paper explores the hypothesis that weakly labeled data can be used to build speech-to-text translation models.<n>We constructed datasets with the help of bitext mining using state-of-the-art sentence encoders.<n>Results demonstrate that ST systems can be built using weakly labeled data, with performance comparable to massive multi-modal multilingual baselines.
arXiv Detail & Related papers (2025-06-19T12:11:01Z)
Trans-Zero: Self-Play Incentivizes Large Language Models for Multilingual Translation Without Parallel Data [64.4458540273004]
We propose a self-play framework that leverages only monolingual data and the intrinsic multilingual knowledge of Large Language Models (LLMs) Experiments demonstrate that this approach not only matches the performance of models trained on large-scale parallel data but also excels in non-English translation directions.
arXiv Detail & Related papers (2025-04-20T16:20:30Z)
Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu [53.437954702561065]
In-context machine translation (MT) with large language models (LLMs) is a promising approach for low-resource MT. This study systematically investigates how each resource and its quality affects the translation performance, with the Manchu language. Our results indicate that high-quality dictionaries and good parallel examples are very helpful, while grammars hardly help.
arXiv Detail & Related papers (2025-02-17T14:53:49Z)
From Priest to Doctor: Domain Adaptation for Low-Resource Neural Machine Translation [3.666125285899499]
Many languages have insufficient data to train high-performing general neural machine translation (NMT) models. Many of the world's languages have insufficient data to train high-performing general neural machine translation (NMT) models.
arXiv Detail & Related papers (2024-12-01T21:06:08Z)
Low-Resource Machine Translation through the Lens of Personalized Federated Learning [26.436144338377755]
We present a new approach that can be applied to Natural Language Tasks with heterogeneous data. We evaluate it on the Low-Resource Machine Translation task, using the dataset from the Large-Scale Multilingual Machine Translation Shared Task. In addition to its effectiveness, MeritFed is also highly interpretable, as it can be applied to track the impact of each language used for training.
arXiv Detail & Related papers (2024-06-18T12:50:00Z)
Cross-lingual Transfer or Machine Translation? On Data Augmentation for Monolingual Semantic Textual Similarity [2.422759879602353]
Cross-lingual transfer of Wikipedia data exhibits improved performance for monolingual STS. We find a superiority of the Wikipedia domain over the NLI domain for these languages, in contrast to prior studies that focused on NLI as training data.
arXiv Detail & Related papers (2024-03-08T12:28:15Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Ngambay-French Neural Machine Translation (sba-Fr) [16.55378462843573]
In Africa, and the world at large, there is an increasing focus on developing Neural Machine Translation (NMT) systems to overcome language barriers. In this project, we created the first sba-Fr dataset, which is a corpus of Ngambay-to-French translations. Our experiments show that the M2M100 model outperforms other models with high BLEU scores on both original and original+synthetic data.
arXiv Detail & Related papers (2023-08-25T17:13:20Z)
When Does Monolingual Data Help Multilingual Translation: The Role of Domain and Model Scale [73.69252847606212]
We examine how denoising autoencoding (DAE) and backtranslation (BT) impact machine translation (MMT) We find that monolingual data generally helps MMT, but models are surprisingly brittle to domain mismatches, especially at smaller model scales. As scale increases, DAE transitions from underperforming the parallel-only baseline at 90M to converging with BT performance at 1.6B, and even surpassing it in low-resource.
arXiv Detail & Related papers (2023-05-23T14:48:42Z)
Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models. We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks. OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z)
Exploiting Language Relatedness in Machine Translation Through Domain Adaptation Techniques [3.257358540764261]
We present a novel approach of using a scaled similarity score of sentences, especially for related languages based on a 5-gram KenLM language model. Our approach succeeds in increasing 2 BLEU point on multi-domain approach, 3 BLEU point on fine-tuning for NMT and 2 BLEU point on iterative back-translation approach.
arXiv Detail & Related papers (2023-03-03T09:07:30Z)
Learning to Generalize to More: Continuous Semantic Augmentation for Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT) CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z)
Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages. Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline. Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z)
Leveraging Monolingual Data with Self-Supervision for Multilingual Neural Machine Translation [54.52971020087777]
Using monolingual data significantly boosts the translation quality of low-resource languages in multilingual models. Self-supervision improves zero-shot translation quality in multilingual models. We get up to 33 BLEU on ro-en translation without any parallel data or back-translation.
arXiv Detail & Related papers (2020-05-11T00:20:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.