Related papers: Saudi-Dialect-ALLaM: LoRA Fine-Tuning for Dialectal Arabic Generation

Saudi-Dialect-ALLaM: LoRA Fine-Tuning for Dialectal Arabic Generation

URL: http://arxiv.org/abs/2508.13525v1
Date: Tue, 19 Aug 2025 05:33:48 GMT
Title: Saudi-Dialect-ALLaM: LoRA Fine-Tuning for Dialectal Arabic Generation
Authors: Hassan Barmandah,
Abstract summary: Large language models (LLMs) for Arabic are still dominated by Modern Standard Arabic (MSA)<n>This underrepresentation hinders their ability to capture authentic dialectal variation.<n>We use a privately curated Saudi Dialect Instruction dataset to develop a foundation model for Saudi dialect generation.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) for Arabic are still dominated by Modern Standard Arabic (MSA), with limited support for Saudi dialects such as Najdi and Hijazi. This underrepresentation hinders their ability to capture authentic dialectal variation. Using a privately curated Saudi Dialect Instruction dataset (Hijazi and Najdi; 5,466 synthetic instruction-response pairs; 50/50 split), we LoRA-tune ALLaM-7B-Instruct-preview, the first foundation model developed in Saudi Arabia, for Saudi dialect generation. We investigate two variants: (i) Dialect-Token training, which prepends an explicit dialect tag to the instruction, and (ii) No-Token training, which omits the tag at formatting time. Evaluation on a held-out test set combines an external dialect classifier with text fidelity metrics (chrF++ and BERTScore) and diversity measures. The Dialect-Token model achieves the best control, raising the Saudi rate from 47.97% to 84.21% and reducing MSA leakage from 32.63% to 6.21%; fidelity also improves (chrF++ +3.53, BERTScore +0.059). Both LoRA variants outperform strong generic instruction models (Falcon-7B-Instruct, Llama-3.1-8B-Instruct, Qwen-2.5-7B-Instruct, AceGPT-v2-8B-Chat, JAIS-13B-Chat) in dialect control and fidelity, while avoiding metadata-tag echoing that these baselines frequently exhibit. We do not release the dataset or any model weights/adapters; instead, we release training/evaluation/inference code and a detailed datasheet (schema and aggregate statistics) to support independent verification.

Related papers

Doing More with Less: Data Augmentation for Sudanese Dialect Automatic Speech Recognition [0.0]
This paper presents a study of data augmentation techniques for fine-tuning OpenAI Whisper models.<n>It establishes the first benchmark for the Sudanese dialect.
arXiv Detail & Related papers (2026-01-11T08:28:31Z)
DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models [54.10223256792762]
We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects.<n>We extend the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects.
arXiv Detail & Related papers (2025-10-31T15:17:06Z)
DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation [111.94720088481614]
Can multimodal generative models effectively produce content given dialectal textual input?<n>We construct a new large-scale benchmark spanning six common English dialects.<n>We design a general encoder-based mitigation strategy for multimodal generative models.
arXiv Detail & Related papers (2025-10-16T17:56:55Z)
Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale [51.41777906371754]
We present Hala, a family of Arabic-centric instruction and translation models built with our translate-and-tune pipeline.<n>A lightweight language model LFM2-1.2B is then fine-tuned on this data and used to translate high-quality English instruction sets into Arabic.<n>We train Hala models at 350M, 700M, 1.2B, and 9B parameters, and apply slerp merging to balance Arabic specialization with base-model strengths.
arXiv Detail & Related papers (2025-09-17T14:19:28Z)
UI-Level Evaluation of ALLaM 34B: Measuring an Arabic-Centric LLM via HUMAIN Chat [1.2788586581322734]
Saudi Data and AI Authority introduced the $ALLaM$ family of Arabic-focused models.<n>The most capable of these available to the public, $ALLaM-34B$, was adopted by HUMAIN, who developed and deployed HUMAIN Chat.<n>This paper presents an expanded and refined UI-level evaluation of $ALLaM-34B$.
arXiv Detail & Related papers (2025-08-24T14:32:15Z)
Arabic Dialect Classification using RNNs, Transformers, and Large Language Models: A Comparative Analysis [0.0]
The Arabic language is among the most popular languages in the world with a huge variety of dialects spoken in 22 countries.<n>In this study, we address the problem of classifying 18 Arabic dialects of the QADI dataset of Arabic tweets.<n>Among these, MARBERTv2 performed best with 65% accuracy and 64% F1-score.
arXiv Detail & Related papers (2025-06-24T16:06:58Z)
Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion [70.23624194206171]
This paper addresses the need for democratizing large language models (LLM) in the Arab world.<n>One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding.<n>Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion.
arXiv Detail & Related papers (2024-12-16T19:29:06Z)
AlcLaM: Arabic Dialectal Language Model [2.8477895544986955]
We construct an Arabic dialectal corpus comprising 3.4M sentences gathered from social media platforms. We utilize this corpus to expand the vocabulary and retrain a BERT-based model from scratch. Named AlcLaM, our model was trained using only 13 GB of text, which represents a fraction of the data used by existing models.
arXiv Detail & Related papers (2024-07-18T02:13:50Z)
ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language. Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region. Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z)
ALDi: Quantifying the Arabic Level of Dialectness of Text [17.37857915257019]
We argue that Arabic speakers perceive a spectrum of dialectness, which we operationalize at the sentence level as the Arabic Level of Dialectness (ALDi) We provide a detailed analysis of AOC-ALDi and show that a model trained on it can effectively identify levels of dialectness on a range of other corpora.
arXiv Detail & Related papers (2023-10-20T18:07:39Z)
AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.