PalmX 2025: The First Shared Task on Benchmarking LLMs on Arabic and Islamic Culture
- URL: http://arxiv.org/abs/2509.02550v1
- Date: Tue, 02 Sep 2025 17:48:51 GMT
- Title: PalmX 2025: The First Shared Task on Benchmarking LLMs on Arabic and Islamic Culture
- Authors: Fakhraddin Alwajih, Abdellah El Mekki, Hamdy Mubarak, Majd Hawasly, Abubakr Mohamed, Muhammad Abdul-Mageed,
- Abstract summary: PalmX 2025 is the first task designed to benchmark the cultural competence of Large Language Models (LLMs) in Arabic and Islamic cultures.<n>The task is composed of two subtasks featuring multiple-choice questions (MCQs) in Modern Standard Arabic (MSA): General Arabic Culture and General Islamic Culture.<n>The top-performing teams achieved an accuracy of 72.15% on cultural questions and 84.22% on Islamic knowledge.
- Score: 30.595844336001004
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) inherently reflect the vast data distributions they encounter during their pre-training phase. As this data is predominantly sourced from the web, there is a high chance it will be skewed towards high-resourced languages and cultures, such as those of the West. Consequently, LLMs often exhibit a diminished understanding of certain communities, a gap that is particularly evident in their knowledge of Arabic and Islamic cultures. This issue becomes even more pronounced with increasingly under-represented topics. To address this critical challenge, we introduce PalmX 2025, the first shared task designed to benchmark the cultural competence of LLMs in these specific domains. The task is composed of two subtasks featuring multiple-choice questions (MCQs) in Modern Standard Arabic (MSA): General Arabic Culture and General Islamic Culture. These subtasks cover a wide range of topics, including traditions, food, history, religious practices, and language expressions from across 22 Arab countries. The initiative drew considerable interest, with 26 teams registering for Subtask 1 and 19 for Subtask 2, culminating in nine and six valid submissions, respectively. Our findings reveal that task-specific fine-tuning substantially boosts performance over baseline models. The top-performing systems achieved an accuracy of 72.15% on cultural questions and 84.22% on Islamic knowledge. Parameter-efficient fine-tuning emerged as the predominant and most effective approach among participants, while the utility of data augmentation was found to be domain-dependent.
Related papers
- ALPS: A Diagnostic Challenge Set for Arabic Linguistic & Pragmatic Reasoning [0.0]
ALPS (Arabic Linguistic & Pragmatic Suite) is a native, expert-curated diagnostic challenge set probing Deep Semantics and Pragmatics.<n> ALPS targets the depth of linguistic understanding through 531 rigorously crafted questions across 15 tasks and 47 subtasks.<n>We developed the dataset with deep expertise in Arabic linguistics, guaranteeing cultural authenticity and eliminating translation artifacts.
arXiv Detail & Related papers (2026-02-19T03:51:37Z) - OmniEduBench: A Comprehensive Chinese Benchmark for Evaluating Large Language Models in Education [72.40048732210055]
We introduce OmniEduBench, a comprehensive Chinese educational benchmark.<n>The data is meticulously divided into two core dimensions: the knowledge dimension and the cultivation dimension.<n>The dataset features a rich variety of question formats, including 11 common exam question types.
arXiv Detail & Related papers (2025-10-30T12:16:29Z) - BALSAM: A Platform for Benchmarking Arabic Large Language Models [34.50348949235453]
BALSAM is a comprehensive, community-driven benchmark aimed at advancing Arabic LLM development and evaluation.<n>It includes 78 NLP tasks from 14 broad categories, with 52K examples divided into 37K test and 15K development, and a centralized, transparent platform for blind evaluation.
arXiv Detail & Related papers (2025-07-30T12:16:39Z) - Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs [14.874327728051288]
We introduce our dataset, a year-long community-driven project covering all 22 Arab countries.<n>The dataset includes instructions in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics.<n>We use our dataset to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations.
arXiv Detail & Related papers (2025-02-28T19:59:13Z) - AIN: The Arabic INclusive Large Multimodal Model [71.29419186696138]
AIN is an English-Arabic bilingual LMM designed to excel in English and Arabic.<n>AIN demonstrates state-of-the-art Arabic performance, while also possessing strong English-language visual capabilities.<n>AIN's superior capabilities position it as a significant step toward empowering Arabic speakers with advanced multimodal generative AI tools.
arXiv Detail & Related papers (2025-01-31T18:58:20Z) - Arabic Dataset for LLM Safeguard Evaluation [62.96160492994489]
This study explores the safety of large language models (LLMs) in Arabic with its linguistic and cultural complexities.<n>We present an Arab-region-specific safety evaluation dataset consisting of 5,799 questions, including direct attacks, indirect attacks, and harmless requests with sensitive words.
arXiv Detail & Related papers (2024-10-22T14:12:43Z) - LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content [9.539308087147134]
Large Language Models (LLMs) have demonstrated remarkable success as general-purpose task solvers across various fields.<n>This study focuses on developing a specialized LLM, LlamaLens, for analyzing news and social media content in a multilingual context.<n>We demonstrate that LlamaLens outperforms the current state-of-the-art (SOTA) on 23 testing sets, and achieves comparable performance on 8 sets.
arXiv Detail & Related papers (2024-10-20T06:37:37Z) - CulturalBench: A Robust, Diverse, and Challenging Cultural Benchmark by Human-AI CulturalTeaming [75.82306181299153]
CulturalBench is a set of 1,696 human-written and human-verified questions to assess LMs' cultural knowledge.<n>It covers 45 global regions including underrepresented ones like Bangladesh, Zimbabwe, and Peru.<n>We construct CulturalBench using methods inspired by Human-AI Red-Teaming.
arXiv Detail & Related papers (2024-10-03T17:04:31Z) - ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region.
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.