Text Generation Models for Luxembourgish with Limited Data: A Balanced Multilingual Strategy
- URL: http://arxiv.org/abs/2412.09415v2
- Date: Fri, 20 Dec 2024 09:43:33 GMT
- Title: Text Generation Models for Luxembourgish with Limited Data: A Balanced Multilingual Strategy
- Authors: Alistair Plum, Tharindu Ranasinghe, Christoph Purschke,
- Abstract summary: This paper addresses the challenges in developing language models for less-represented languages, with a focus on Luxembourgish.
We propose a novel text generation model based on the T5 architecture, combining limited Luxembourgish data with equal amounts of German and French data.
For the evaluation, we introduce LuxGen, a text generation benchmark that is the first of its kind for Luxembourgish.
- Score: 7.59001382786429
- License:
- Abstract: This paper addresses the challenges in developing language models for less-represented languages, with a focus on Luxembourgish. Despite its active development, Luxembourgish faces a digital data scarcity, exacerbated by Luxembourg's multilingual context. We propose a novel text generation model based on the T5 architecture, combining limited Luxembourgish data with equal amounts, in terms of size and type, of German and French data. We hypothesise that a model trained on Luxembourgish, German, and French will improve the model's cross-lingual transfer learning capabilities and outperform monolingual and large multilingual models. To verify this, the study at hand explores whether multilingual or monolingual training is more beneficial for Luxembourgish language generation. For the evaluation, we introduce LuxGen, a text generation benchmark that is the first of its kind for Luxembourgish.
Related papers
- Adapting Multilingual Embedding Models to Historical Luxembourgish [5.474797258314828]
Pre-trained multilingual models, typically evaluated on contemporary texts, face challenges with historical digitized content due to OCR noise and outdated spellings.
We explore the use of multilingual embeddings for cross-lingual semantic search on historical Luxembourgish.
arXiv Detail & Related papers (2025-02-11T20:35:29Z) - LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings [8.839362558895594]
Sentence embedding models rely heavily on parallel data, which can be scarce for many low-resource languages, including Luxembourgish.
This scarcity results in suboptimal performance of monolingual and cross-lingual sentence embedding models for these languages.
We present evidence suggesting that including low-resource languages in parallel training datasets can be more advantageous for other low-resource languages.
arXiv Detail & Related papers (2024-12-04T14:02:12Z) - LuxBank: The First Universal Dependency Treebank for Luxembourgish [0.38447712214412116]
Luxembourgish is a West Germanic language spoken by approximately 400,000 people.
We introduce LuxBank, the first Universal Dependencies (UD) Treebank for Luxembourgish.
arXiv Detail & Related papers (2024-11-07T15:50:40Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - CroissantLLM: A Truly Bilingual French-English Language Model [42.03897426049679]
We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens.
We pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio.
To assess performance outside of English, we craft a novel benchmark, FrenchBench.
arXiv Detail & Related papers (2024-02-01T17:17:55Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Transfer to a Low-Resource Language via Close Relatives: The Case Study
on Faroese [54.00582760714034]
Cross-lingual NLP transfer can be improved by exploiting data and models of high-resource languages.
We release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS) and new language models trained on all Scandinavian languages.
arXiv Detail & Related papers (2023-04-18T08:42:38Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - Are Pretrained Multilingual Models Equally Fair Across Languages? [0.0]
This work investigates the group fairness of multilingual models, asking whether these models are equally fair across languages.
We evaluate three multilingual models on MozArt -- mBERT, XLM-R, and mT5 -- and show that across the four target languages, the three models exhibit different levels of group disparity.
arXiv Detail & Related papers (2022-10-11T13:59:19Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.