Related papers: Low-Resource Machine Translation for Low-Resource Languages: Leveraging Comparable Data, Code-Switching and Compute Resources

Low-Resource Machine Translation for Low-Resource Languages: Leveraging Comparable Data, Code-Switching and Compute Resources

URL: http://arxiv.org/abs/2103.13272v1
Date: Wed, 24 Mar 2021 15:40:28 GMT
Title: Low-Resource Machine Translation for Low-Resource Languages: Leveraging Comparable Data, Code-Switching and Compute Resources
Authors: Garry Kuwanto, Afra Feyza Aky\"urek, Isidora Chara Tourni, Siyang Li, Derry Wijaya
Abstract summary: We conduct an empirical study of unsupervised neural machine translation (NMT) for truly low resource languages. We show how adding comparable data mined using a bilingual dictionary along with modest additional compute resource to train the model can significantly improve its performance. Our work is the first to quantitatively showcase the impact of different modest compute resource in low resource NMT.
Score: 4.119597443825115
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We conduct an empirical study of unsupervised neural machine translation (NMT) for truly low resource languages, exploring the case when both parallel training data and compute resource are lacking, reflecting the reality of most of the world's languages and the researchers working on these languages. We propose a simple and scalable method to improve unsupervised NMT, showing how adding comparable data mined using a bilingual dictionary along with modest additional compute resource to train the model can significantly improve its performance. We also demonstrate how the use of the dictionary to code-switch monolingual data to create more comparable data can further improve performance. With this weak supervision, our best method achieves BLEU scores that improve over supervised results for English$\rightarrow$Gujarati (+18.88), English$\rightarrow$Kazakh (+5.84), and English$\rightarrow$Somali (+1.16), showing the promise of weakly-supervised NMT for many low resource languages with modest compute resource in the world. To the best of our knowledge, our work is the first to quantitatively showcase the impact of different modest compute resource in low resource NMT.

Related papers

Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu [53.437954702561065]
In-context machine translation (MT) with large language models (LLMs) is a promising approach for low-resource MT. This study systematically investigates how each resource and its quality affects the translation performance, with the Manchu language. Our results indicate that high-quality dictionaries and good parallel examples are very helpful, while grammars hardly help.
arXiv Detail & Related papers (2025-02-17T14:53:49Z)
Enhancing Code Generation for Low-Resource Languages: No Silver Bullet [55.39571645315926]
Large Language Models (LLMs) rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages. For low-resource languages, the limited availability of such data hampers the models' ability to generalize effectively. We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages.
arXiv Detail & Related papers (2025-01-31T12:23:28Z)
Efficient Continual Pre-training of LLMs for Low-resource Languages [45.44796295841526]
We develop a new algorithm to select a subset of texts from a larger corpus. In search of further improvement, we design a new algorithm to select tokens to include in the LLM vocabulary.
arXiv Detail & Related papers (2024-12-13T16:13:35Z)
UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages [2.66269503676104]
Large language models (LLMs) under-perform on low-resource languages. We present a method to efficiently collect text data for low-resource languages. Our approach, UnifiedCrawl, filters and extracts common crawl using minimal compute resources.
arXiv Detail & Related papers (2024-11-21T17:41:08Z)
Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT) We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training. Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z)
Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages [60.162717568496355]
Large language models (LLMs) have been pre-trained on multilingual corpora. Their performance still lags behind in most languages compared to a few resource-rich languages.
arXiv Detail & Related papers (2024-02-19T15:07:32Z)
Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars. We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English. Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z)
MetaXLR -- Mixed Language Meta Representation Transformation for Low-resource Cross-lingual Learning based on Multi-Armed Bandit [0.0]
We propose an enhanced approach which uses multiple source languages chosen in a data driven manner. We achieve state of the art results on the NER task for the extremely low resource languages while using the same amount of data.
arXiv Detail & Related papers (2023-05-31T18:22:33Z)
Semi-supervised Neural Machine Translation with Consistency Regularization for Low-Resource Languages [3.475371300689165]
This paper presents a simple yet effective method to tackle the problem for low-resource languages by augmenting high-quality sentence pairs and training NMT models in a semi-supervised manner. Specifically, our approach combines the cross-entropy loss for supervised learning with KL Divergence for unsupervised fashion given pseudo and augmented target sentences. Experimental results show that our approach significantly improves NMT baselines, especially on low-resource datasets with 0.46--2.03 BLEU scores.
arXiv Detail & Related papers (2023-04-02T15:24:08Z)
Adapting to the Low-Resource Double-Bind: Investigating Low-Compute Methods on Low-Resource African Languages [0.6833698896122186]
Access to high computational resources added to the issue of data scarcity of African languages. We evaluate language adapters as cost-effective approaches to low-resource African NLP. This opens the door to further experimentation and exploration on full-extent of language adapters capacities.
arXiv Detail & Related papers (2023-03-29T19:25:43Z)
Learning Translation Quality Evaluation on Low Resource Languages from Large Language Models [4.168157981135698]
We show how knowledge can be distilled from Large Language Models (LLMs) to improve upon learned metrics without requiring human annotators. We show that the performance of a BLEURT-like model on lower resource languages can be improved in this way.
arXiv Detail & Related papers (2023-02-07T14:35:35Z)
Distributionally Robust Multilingual Machine Translation [94.51866646879337]
We propose a new learning objective for Multilingual neural machine translation (MNMT) based on distributionally robust optimization. We show how to practically optimize this objective for large translation corpora using an iterated best response scheme. Our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
arXiv Detail & Related papers (2021-09-09T03:48:35Z)
Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model. We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z)
Leveraging Monolingual Data with Self-Supervision for Multilingual Neural Machine Translation [54.52971020087777]
Using monolingual data significantly boosts the translation quality of low-resource languages in multilingual models. Self-supervision improves zero-shot translation quality in multilingual models. We get up to 33 BLEU on ro-en translation without any parallel data or back-translation.
arXiv Detail & Related papers (2020-05-11T00:20:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.