Related papers: Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants

Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants

URL: http://arxiv.org/abs/2510.24328v1
Date: Tue, 28 Oct 2025 11:52:51 GMT
Title: Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants
Authors: Hunzalah Hassan Bhatti, Firoj Alam,
Abstract summary: Large Language Models (LLMs) are increasingly used to answer everyday questions.<n>Their performance on culturally grounded and dialectal content remains uneven across languages.<n>We propose a comprehensive method that translates Modern Standard Arabic (MSA) multiple-choice questions (MCQs) into English and several Arabic dialects.
Score: 7.228273711234901
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Language Models (LLMs) are increasingly used to answer everyday questions, yet their performance on culturally grounded and dialectal content remains uneven across languages. We propose a comprehensive method that (i) translates Modern Standard Arabic (MSA) multiple-choice questions (MCQs) into English and several Arabic dialects, (ii) converts them into open-ended questions (OEQs), (iii) benchmarks a range of zero-shot and fine-tuned LLMs under both MCQ and OEQ settings, and (iv) generates chain-of-thought (CoT) rationales to fine-tune models for step-by-step reasoning. Using this method, we extend an existing dataset in which QAs are parallelly aligned across multiple language varieties, making it, to our knowledge, the first of its kind. We conduct extensive experiments with both open and closed models. Our findings show that (i) models underperform on Arabic dialects, revealing persistent gaps in culturally grounded and dialect-specific knowledge; (ii) Arabic-centric models perform well on MCQs but struggle with OEQs; and (iii) CoT improves judged correctness while yielding mixed n-gram-based metrics. The developed dataset will be publicly released to support further research on culturally and linguistically inclusive evaluation.

Related papers

DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models [54.10223256792762]
We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects.<n>We extend the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects.
arXiv Detail & Related papers (2025-10-31T15:17:06Z)
FarsiMCQGen: a Persian Multiple-choice Question Generation Framework [2.026379197206863]
This paper introduces FarsiMCQGen, an innovative approach for generating Persian-language multiple-choice questions (MCQs)<n>Our methodology combines candidate generation, filtering, and ranking techniques to build a model that generates answer choices resembling those in real MCQs.<n>We leverage advanced methods, including Transformers and knowledge graphs, integrated with rule-based approaches to craft credible distractors that challenge test-takers.
arXiv Detail & Related papers (2025-10-16T20:52:07Z)
A method for improving multilingual quality and diversity of instruction fine-tuning datasets [29.07537849245622]
We introduce Multilingual Data Quality and Diversity (M-DaQ) to improve Multilingual Instruction Fine-Tuning (IFT)<n>M-DaQ is a novel method for improving LLMs multilinguality by selecting high-quality and semantically diverse multilingual IFT samples.<n> Empirical results across 18 languages demonstrate that models fine-tuned with M-DaQ achieve significant performance gains over vanilla baselines over 60% win rate.
arXiv Detail & Related papers (2025-09-19T03:07:59Z)
HeQ: a Large and Diverse Hebrew Reading Comprehension Benchmark [54.73504952691398]
We set out to deliver a Hebrew Machine Reading dataset as extractive Questioning.<n>The morphologically rich nature of Hebrew poses a challenge to this endeavor.<n>We devise a novel set of guidelines, a controlled crowdsourcing protocol, and revised evaluation metrics.
arXiv Detail & Related papers (2025-08-03T15:53:01Z)
AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs [22.121471902726892]
We present AraDiCE, a benchmark for Arabic Dialect and Cultural Evaluation.<n>First-ever fine-grained benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and Levant regions.
arXiv Detail & Related papers (2024-09-17T17:59:25Z)
CaLMQA: Exploring culturally specific long-form question answering across 23 languages [58.18984409715615]
CaLMQA is a dataset of 51.7K culturally specific questions across 23 different languages.<n>We evaluate factuality, relevance and surface-level quality of LLM-generated long-form answers.
arXiv Detail & Related papers (2024-06-25T17:45:26Z)
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures. CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions. We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z)
From Multiple-Choice to Extractive QA: A Case Study for English and Arabic [51.13706104333848]
We explore the feasibility of repurposing an existing multilingual dataset for a new NLP task.<n>We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic.<n>We aim to help others adapt our approach for the remaining 120 BELEBELE language variants, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z)
ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language. Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region. Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z)
AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.