Related papers: AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

URL: http://arxiv.org/abs/2511.14295v1
Date: Tue, 18 Nov 2025 09:47:01 GMT
Title: AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models
Authors: Mohammad Zbib, Hasan Abed Al Kader Hammoud, Sina Mukalled, Nadine Rizk, Fatima Karnib, Issam Lakkis, Ammar Mohanna, Bernard Ghanem,
Abstract summary: AraLingBench is a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs)<n>The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions.
Score: 37.79823471716066
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present AraLingBench: a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.

Related papers

DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models [54.10223256792762]
We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects.<n>We extend the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects.
arXiv Detail & Related papers (2025-10-31T15:17:06Z)
ChiKhaPo: A Large-Scale Multilingual Benchmark for Evaluating Lexical Comprehension and Generation in Large Language Models [4.615257892219717]
We introduce ChiKhaPo, consisting of 8 subtasks of varying difficulty designed to evaluate the lexical comprehension and generation abilities of generative models.<n>ChiKhaPo draws on existing lexicons, monolingual data, and bitext, and provides coverage for 2700+ languages for 2 subtasks, surpassing any existing benchmark in terms of language coverage.
arXiv Detail & Related papers (2025-10-19T16:55:20Z)
LingBench++: A Linguistically-Informed Benchmark and Reasoning Framework for Multi-Step and Cross-Cultural Inference with LLMs [0.631976908971572]
LingBench++ is a benchmark and reasoning framework for evaluating large language models (LLMs)<n>It provides structured reasoning traces, stepwise evaluation protocols, and rich typological metadata across over 90 languages.<n>We show that models equipped with external knowledge sources and iterative reasoning outperform single-pass approaches in both accuracy and interpretability.
arXiv Detail & Related papers (2025-07-22T17:57:44Z)
BnMMLU: Measuring Massive Multitask Language Understanding in Bengali [0.0]
We introduce BnMMLU, a benchmark to evaluate the language understanding capabilities of Bengali in language models.<n>The dataset spans 23 domains, including science, humanities, mathematics and general knowledge.<n>We benchmark several proprietary and open-source large language models (LLMs) on the BnMMLU test set.
arXiv Detail & Related papers (2025-05-25T02:54:31Z)
ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts [8.181151553582488]
textttScholarBench is a benchmark for evaluating the academic reasoning ability of large language models (LLMs)<n>The benchmark comprises 5,031 examples in Korean and 5,309 in English, with even state-of-the-art models like o3-mini achieving an average evaluation score of only 0.543.
arXiv Detail & Related papers (2025-05-22T11:59:06Z)
PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts [85.78821098963607]
PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels.<n>Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation.
arXiv Detail & Related papers (2025-04-25T15:39:04Z)
MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models [65.10456412127405]
MLaKE is a benchmark for the adaptability of knowledge editing methods across five languages.<n>MLaKE aggregates fact chains from Wikipedia across languages and generates questions in both free-form and multiple-choice.<n>We evaluate the multilingual knowledge editing generalization capabilities of existing methods on MLaKE.
arXiv Detail & Related papers (2024-04-07T15:23:28Z)
Pragmatic Competence Evaluation of Large Language Models for the Korean Language [0.6757476692230009]
This study evaluates how well Large Language Models (LLMs) understand context-dependent expressions from a pragmatic standpoint, specifically in Korean. We use both Multiple-Choice Questions (MCQs) for automatic evaluation and Open-Ended Questions (OEQs) assessed by human experts.
arXiv Detail & Related papers (2024-03-19T12:21:20Z)
Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks. Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena. For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z)
CUGE: A Chinese Language Understanding and Generation Evaluation Benchmark [144.05723617401674]
General-purpose language intelligence evaluation has been a longstanding goal for natural language processing. We argue that for general-purpose language intelligence evaluation, the benchmark itself needs to be comprehensive and systematic. We propose CUGE, a Chinese Language Understanding and Generation Evaluation benchmark with the following features.
arXiv Detail & Related papers (2021-12-27T11:08:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.