Letz Translate: Low-Resource Machine Translation for Luxembourgish
- URL: http://arxiv.org/abs/2303.01347v1
- Date: Thu, 2 Mar 2023 15:26:46 GMT
- Title: Letz Translate: Low-Resource Machine Translation for Luxembourgish
- Authors: Yewei Song, Saad Ezzini, Jacques Klein, Tegawende Bissyande, Cl\'ement
Lefebvre and Anne Goujon
- Abstract summary: We build resource-efficient models based on German, knowledge distillation from the multilingual No Language Left Behind model, and pseudo-translation.
We find that our efficient models are more than 30% faster and perform only 4% lower compared to the large state-of-the-art NLLB model.
- Score: 4.860100893494234
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Natural language processing of Low-Resource Languages (LRL) is often
challenged by the lack of data. Therefore, achieving accurate machine
translation (MT) in a low-resource environment is a real problem that requires
practical solutions. Research in multilingual models have shown that some LRLs
can be handled with such models. However, their large size and computational
needs make their use in constrained environments (e.g., mobile/IoT devices or
limited/old servers) impractical. In this paper, we address this problem by
leveraging the power of large multilingual MT models using knowledge
distillation. Knowledge distillation can transfer knowledge from a large and
complex teacher model to a simpler and smaller student model without losing
much in performance. We also make use of high-resource languages that are
related or share the same linguistic root as the target LRL. For our
evaluation, we consider Luxembourgish as the LRL that shares some roots and
properties with German. We build multiple resource-efficient models based on
German, knowledge distillation from the multilingual No Language Left Behind
(NLLB) model, and pseudo-translation. We find that our efficient models are
more than 30\% faster and perform only 4\% lower compared to the large
state-of-the-art NLLB model.
Related papers
- Quality or Quantity? On Data Scale and Diversity in Adapting Large Language Models for Low-Resource Translation [62.202893186343935]
We explore what it would take to adapt Large Language Models for low-resource languages.
We show that parallel data is critical during both pre-training andSupervised Fine-Tuning (SFT)
Our experiments with three LLMs across two low-resourced language groups reveal consistent trends, underscoring the generalizability of our findings.
arXiv Detail & Related papers (2024-08-23T00:59:38Z) - Machine Translation Hallucination Detection for Low and High Resource Languages using Large Language Models [12.447489454369636]
This paper evaluates sentence-level hallucination detection approaches using Large Language Models (LLMs) and semantic similarity within massively multilingual embeddings.
LLMs can achieve performance comparable or even better than previously proposed models, despite not being explicitly trained for any machine translation task.
arXiv Detail & Related papers (2024-07-23T13:40:54Z) - Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT)
We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training.
Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z) - Empirical Studies of Parameter Efficient Methods for Large Language Models of Code and Knowledge Transfer to R [1.9799527196428242]
Large Langauge Models (LLMs) have gained a lot of attention in the Software Engineering (SE) community.
In this work, we empirically study PEFT methods, LoRA and Compacter, on CodeT5 and CodeLlama.
We will assess their performance compared to fully fine-tuned models, whether they can be used for knowledge transfer from natural language models to code, and their ability to adapt the learned knowledge to an unseen language.
arXiv Detail & Related papers (2024-03-16T03:12:45Z) - Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource Languages [0.0]
Large Language Models (LLMs) have shown incredible proficiency at natural language processing tasks.
LLMs often struggle to perform well on low-resource languages because there is so little training data available.
In this work, we explore training LLaMA-2 to speak Amharic, a language which is spoken by over 50 million people world wide.
arXiv Detail & Related papers (2024-03-11T01:04:36Z) - Enhancing Multilingual Capabilities of Large Language Models through
Self-Distillation from Resource-Rich Languages [60.162717568496355]
Large language models (LLMs) have been pre-trained on multilingual corpora.
Their performance still lags behind in most languages compared to a few resource-rich languages.
arXiv Detail & Related papers (2024-02-19T15:07:32Z) - GlotLID: Language Identification for Low-Resource Languages [51.38634652914054]
GlotLID-M is an LID model that satisfies the desiderata of wide coverage, reliability and efficiency.
It identifies 1665 languages, a large increase in coverage compared to prior work.
arXiv Detail & Related papers (2023-10-24T23:45:57Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z) - Transfer to a Low-Resource Language via Close Relatives: The Case Study
on Faroese [54.00582760714034]
Cross-lingual NLP transfer can be improved by exploiting data and models of high-resource languages.
We release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS) and new language models trained on all Scandinavian languages.
arXiv Detail & Related papers (2023-04-18T08:42:38Z) - Deep Learning Models for Multilingual Hate Speech Detection [5.977278650516324]
In this paper, we conduct a large scale analysis of multilingual hate speech in 9 languages from 16 different sources.
We observe that in low resource setting, simple models such as LASER embedding with logistic regression performs the best.
In case of zero-shot classification, languages such as Italian and Portuguese achieve good results.
arXiv Detail & Related papers (2020-04-14T13:14:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.