UCCIX: Irish-eXcellence Large Language Model
- URL: http://arxiv.org/abs/2405.13010v1
- Date: Mon, 13 May 2024 13:19:27 GMT
- Title: UCCIX: Irish-eXcellence Large Language Model
- Authors: Khanh-Tung Tran, Barry O'Sullivan, Hoang D. Nguyen,
- Abstract summary: This work presents UCCIX, a pioneering effort on the development of an open-source Irish-based LLM.
We propose a novel framework for continued pre-training of LLMs specifically adapted for extremely low-resource languages.
Our model, based on Llama 2-13B, outperforms much larger models on Irish language tasks with up to 12% performance improvement.
- Score: 3.9530780161144667
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The development of Large Language Models (LLMs) has predominantly focused on high-resource languages, leaving extremely low-resource languages like Irish with limited representation. This work presents UCCIX, a pioneering effort on the development of an open-source Irish-based LLM. We propose a novel framework for continued pre-training of LLMs specifically adapted for extremely low-resource languages, requiring only a fraction of the textual data typically needed for training LLMs according to scaling laws. Our model, based on Llama 2-13B, outperforms much larger models on Irish language tasks with up to 12% performance improvement, showcasing the effectiveness and efficiency of our approach. We also contribute comprehensive Irish benchmarking datasets, including IrishQA, a question-answering dataset, and Irish version of MT-bench. These datasets enable rigorous evaluation and facilitate future research in Irish LLM systems. Our work aims to preserve and promote the Irish language, knowledge, and culture of Ireland in the digital era while providing a framework for adapting LLMs to other indigenous languages.
Related papers
- IRLBench: A Multi-modal, Culturally Grounded, Parallel Irish-English Benchmark for Open-Ended LLM Reasoning Evaluation [3.9530780161144667]
We present IRLBench, presented in parallel English and Irish.<n>Our benchmark consists of 12 representative subjects developed from the 2024 Irish Leaving Certificate exams.<n>We show that models produce valid Irish responses less than 80% of the time, and answer correctly 55.8% of the time compared to 76.2% in English for the best-performing model.
arXiv Detail & Related papers (2025-05-16T00:02:05Z) - Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging -- An Open Recipe [12.076338505539194]
This paper aims to enhance the reasoning capabilities of language-specific large language models (LLMs)
DeepSeek R1 excels in reasoning but primarily benefits high-resource languages such as English and Chinese.
Low-resource languages remain underserved due to the dominance of English-centric training data and model optimizations.
arXiv Detail & Related papers (2025-02-13T08:10:45Z) - Enhancing Code Generation for Low-Resource Languages: No Silver Bullet [55.39571645315926]
Large Language Models (LLMs) rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages.
For low-resource languages, the limited availability of such data hampers the models' ability to generalize effectively.
We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages.
arXiv Detail & Related papers (2025-01-31T12:23:28Z) - Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization [108.6908427615402]
Cross-lingual summarization ( CLS) aims to generate a summary for the source text in a different target language.
Currently, instruction-tuned large language models (LLMs) excel at various English tasks.
Recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings.
arXiv Detail & Related papers (2024-10-26T00:39:44Z) - Multilingual Prompts in LLM-Based Recommenders: Performance Across Languages [0.0]
This work explores the impact of non-English prompts on recommendation performance.
Evaluation on three real-world datasets, namely ML1M, LastFM, and Amazon-Beauty, showed that usage of non-English prompts generally reduce performance.
Retraining with multilingual prompts resulted in more balanced performance across languages, but slightly reduced English performance.
arXiv Detail & Related papers (2024-09-11T20:31:42Z) - A Survey of Large Language Models for European Languages [4.328283741894074]
Large Language Models (LLMs) have gained significant attention due to their high performance on a wide range of natural language tasks.
We present an overview of LLM families, including LLaMA, PaLM, GPT, and MoE.
We provide a comprehensive summary of common monolingual and multilingual datasets used for pretraining large language models.
arXiv Detail & Related papers (2024-08-27T13:10:05Z) - High-quality Data-to-Text Generation for Severely Under-Resourced
Languages with Out-of-the-box Large Language Models [5.632410663467911]
We explore the extent to which pretrained large language models (LLMs) can bridge the performance gap for under-resourced languages.
We find that LLMs easily set the state of the art for the under-resourced languages by substantial margins.
For all our languages, human evaluation shows on-a-par performance with humans for our best systems, but BLEU scores collapse compared to English.
arXiv Detail & Related papers (2024-02-19T16:29:40Z) - Enhancing Multilingual Capabilities of Large Language Models through
Self-Distillation from Resource-Rich Languages [60.162717568496355]
Large language models (LLMs) have been pre-trained on multilingual corpora.
Their performance still lags behind in most languages compared to a few resource-rich languages.
arXiv Detail & Related papers (2024-02-19T15:07:32Z) - Zero-Shot Cross-Lingual Reranking with Large Language Models for
Low-Resource Languages [51.301942056881146]
We investigate how large language models (LLMs) function as rerankers in cross-lingual information retrieval systems for African languages.
Our implementation covers English and four African languages (Hausa, Somali, Swahili, and Yoruba)
We examine cross-lingual reranking with queries in English and passages in the African languages.
arXiv Detail & Related papers (2023-12-26T18:38:54Z) - Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering.
The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored.
We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z) - Augmented Large Language Models with Parametric Knowledge Guiding [72.71468058502228]
Large Language Models (LLMs) have significantly advanced natural language processing (NLP) with their impressive language understanding and generation capabilities.
Their performance may be suboptimal for domain-specific tasks that require specialized knowledge due to limited exposure to the related data.
We propose the novel Parametric Knowledge Guiding (PKG) framework, which equips LLMs with a knowledge-guiding module to access relevant knowledge.
arXiv Detail & Related papers (2023-05-08T15:05:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.