Related papers: UI-Level Evaluation of ALLaM 34B: Measuring an Arabic-Centric LLM via HUMAIN Chat

UI-Level Evaluation of ALLaM 34B: Measuring an Arabic-Centric LLM via HUMAIN Chat

URL: http://arxiv.org/abs/2508.17378v1
Date: Sun, 24 Aug 2025 14:32:15 GMT
Title: UI-Level Evaluation of ALLaM 34B: Measuring an Arabic-Centric LLM via HUMAIN Chat
Authors: Omer Nacar,
Abstract summary: Saudi Data and AI Authority introduced the $ALLaM$ family of Arabic-focused models.<n>The most capable of these available to the public, $ALLaM-34B$, was adopted by HUMAIN, who developed and deployed HUMAIN Chat.<n>This paper presents an expanded and refined UI-level evaluation of $ALLaM-34B$.
Score: 1.2788586581322734
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models (LLMs) trained primarily on English corpora often struggle to capture the linguistic and cultural nuances of Arabic. To address this gap, the Saudi Data and AI Authority (SDAIA) introduced the $ALLaM$ family of Arabic-focused models. The most capable of these available to the public, $ALLaM-34B$, was subsequently adopted by HUMAIN, who developed and deployed HUMAIN Chat, a closed conversational web service built on this model. This paper presents an expanded and refined UI-level evaluation of $ALLaM-34B$. Using a prompt pack spanning modern standard Arabic, five regional dialects, code-switching, factual knowledge, arithmetic and temporal reasoning, creative generation, and adversarial safety, we collected 115 outputs (23 prompts times 5 runs) and scored each with three frontier LLM judges (GPT-5, Gemini 2.5 Pro, Claude Sonnet-4). We compute category-level means with 95\% confidence intervals, analyze score distributions, and visualize dialect-wise metric heat maps. The updated analysis reveals consistently high performance on generation and code-switching tasks (both averaging 4.92/5), alongside strong results in MSA handling (4.74/5), solid reasoning ability (4.64/5), and improved dialect fidelity (4.21/5). Safety-related prompts show stable, reliable performance of (4.54/5). Taken together, these results position $ALLaM-34B$ as a robust and culturally grounded Arabic LLM, demonstrating both technical strength and practical readiness for real-world deployment.

Related papers

AHaSIS: Shared Task on Sentiment Analysis for Arabic Dialects [10.336992954849213]
The Sentiment Analysis on Arabic Dialects in the Hospitality Domain shared task focuses on Sentiment Detection in Arabic Dialects.<n>The dataset consists of 538 sentiment-balanced reviews spanning positive, neutral, and negative categories.<n>The top-performing system achieved an F1 score of 0.81, demonstrating the feasibility and ongoing challenges of sentiment analysis across Arabic dialects.
arXiv Detail & Related papers (2025-11-17T13:06:55Z)
DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models [54.10223256792762]
We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects.<n>We extend the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects.
arXiv Detail & Related papers (2025-10-31T15:17:06Z)
DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation [111.94720088481614]
Can multimodal generative models effectively produce content given dialectal textual input?<n>We construct a new large-scale benchmark spanning six common English dialects.<n>We design a general encoder-based mitigation strategy for multimodal generative models.
arXiv Detail & Related papers (2025-10-16T17:56:55Z)
Saudi-Dialect-ALLaM: LoRA Fine-Tuning for Dialectal Arabic Generation [0.0]
Large language models (LLMs) for Arabic are still dominated by Modern Standard Arabic (MSA)<n>This underrepresentation hinders their ability to capture authentic dialectal variation.<n>We use a privately curated Saudi Dialect Instruction dataset to develop a foundation model for Saudi dialect generation.
arXiv Detail & Related papers (2025-08-19T05:33:48Z)
BALSAM: A Platform for Benchmarking Arabic Large Language Models [34.50348949235453]
BALSAM is a comprehensive, community-driven benchmark aimed at advancing Arabic LLM development and evaluation.<n>It includes 78 NLP tasks from 14 broad categories, with 52K examples divided into 37K test and 15K development, and a centralized, transparent platform for blind evaluation.
arXiv Detail & Related papers (2025-07-30T12:16:39Z)
Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation [15.773183176935405]
This paper presents two core contributions to advancing DA-MSA translation for the Levantine, Egyptian, and Gulf dialects.<n>Few-shot prompting consistently outperformed zero-shot, chain-of-thought, and our proposed Ara-TEaR method.<n>Joint multi-dialect trained models outperformed single-dialect counterparts by over 10% CHrF++.
arXiv Detail & Related papers (2025-07-27T14:37:53Z)
AIN: The Arabic INclusive Large Multimodal Model [71.29419186696138]
AIN is an English-Arabic bilingual LMM designed to excel in English and Arabic.<n>AIN demonstrates state-of-the-art Arabic performance, while also possessing strong English-language visual capabilities.<n>AIN's superior capabilities position it as a significant step toward empowering Arabic speakers with advanced multimodal generative AI tools.
arXiv Detail & Related papers (2025-01-31T18:58:20Z)
Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion [70.23624194206171]
This paper addresses the need for democratizing large language models (LLM) in the Arab world.<n>One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding.<n>Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion.
arXiv Detail & Related papers (2024-12-16T19:29:06Z)
ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language. Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region. Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z)
AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.