Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect
- URL: http://arxiv.org/abs/2409.17912v2
- Date: Mon, 11 Nov 2024 22:14:04 GMT
- Title: Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect
- Authors: Guokan Shang, Hadi Abdine, Yousef Khoubrane, Amr Mohamed, Yassine Abbahaddou, Sofiane Ennadir, Imane Momayiz, Xuguang Ren, Eric Moulines, Preslav Nakov, Michalis Vazirgiannis, Eric Xing,
- Abstract summary: We construct our instruction dataset by consolidating existing Darija language resources.
Atlas-Chat-2B, 9B, and 27B models, fine-tuned on the dataset, exhibit superior ability in following Darija instructions.
- Score: 45.755756115243486
- License:
- Abstract: We introduce Atlas-Chat, the first-ever collection of LLMs specifically developed for dialectal Arabic. Focusing on Moroccan Arabic, also known as Darija, we construct our instruction dataset by consolidating existing Darija language resources, creating novel datasets both manually and synthetically, and translating English instructions with stringent quality control. Atlas-Chat-2B, 9B, and 27B models, fine-tuned on the dataset, exhibit superior ability in following Darija instructions and performing standard NLP tasks. Notably, our models outperform both state-of-the-art and Arabic-specialized LLMs like LLaMa, Jais, and AceGPT, e.g., our 9B model gains a 13% performance boost over a larger 13B model on DarijaMMLU, in our newly introduced evaluation suite for Darija covering both discriminative and generative tasks. Furthermore, we perform an experimental analysis of various fine-tuning strategies and base model choices to determine optimal configurations. All our resources are publicly accessible, and we believe our work offers comprehensive design methodologies of instruction-tuning for low-resource languages, which are often neglected in favor of data-rich languages by contemporary LLMs.
Related papers
- GemmAr: Enhancing LLMs Through Arabic Instruction-Tuning [0.0]
We introduce InstAr-500k, a new Arabic instruction dataset created by generating and collecting content.
We assess this dataset by fine-tuning an open-source Gemma-7B model on several downstream tasks to improve its functionality.
Based on multiple evaluations, our fine-tuned model achieves excellent performance on several Arabic NLP benchmarks.
arXiv Detail & Related papers (2024-07-02T10:43:49Z) - Bridging the Bosphorus: Advancing Turkish Large Language Models through Strategies for Low-Resource Language Adaptation and Benchmarking [1.3716808114696444]
Large Language Models (LLMs) are becoming crucial across various fields, emphasizing the urgency for high-quality models in underrepresented languages.
This study explores the unique challenges faced by low-resource languages, such as data scarcity, model selection, evaluation, and computational limitations.
arXiv Detail & Related papers (2024-05-07T21:58:45Z) - Data-Augmentation-Based Dialectal Adaptation for LLMs [26.72394783468532]
This report presents GMUNLP's participation to the Dialect-Copa shared task at VarDial 2024.
The task focuses on evaluating the commonsense reasoning capabilities of large language models (LLMs) on South Slavic micro-dialects.
We propose an approach that combines the strengths of different types of language models and leverages data augmentation techniques to improve task performance.
arXiv Detail & Related papers (2024-04-11T19:15:32Z) - Walia-LLM: Enhancing Amharic-LLaMA by Integrating Task-Specific and Generative Datasets [2.8123257987021058]
We focus on enhancing the LLaMA-2-Amharic model by integrating task-specific and generative datasets.
We compile an Amharic instruction fine-tuning dataset and fine-tuned LLaMA-2-Amharic model.
The fine-tuned model shows promising results in different NLP tasks.
arXiv Detail & Related papers (2024-02-12T19:25:11Z) - Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering.
The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored.
We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z) - YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open
Generative Large Language Models [57.76998376458017]
We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric foundation and instruction-tuned open generative large language models (LLMs)
The models are based on the GPT-3 decoder-only architecture and are pretrained on a mixture of Arabic and English texts.
We provide a detailed description of the training, the tuning, the safety alignment, and the evaluation of the models.
arXiv Detail & Related papers (2023-08-30T17:07:17Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z) - ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic [6.021269454707625]
We introduce two powerful deep bidirectional transformer-based models, ARBERT and MARBERT, that have superior performance to all existing models.
ArBench is built using 41 datasets targeting 5 different tasks/task clusters.
When fine-tuned on ArBench, ARBERT and MARBERT collectively achieve new SOTA with sizeable margins compared to all existing models.
arXiv Detail & Related papers (2020-12-27T06:32:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.