Related papers: MizanQA: Benchmarking Large Language Models on Moroccan Legal Question Answering

MizanQA: Benchmarking Large Language Models on Moroccan Legal Question Answering

URL: http://arxiv.org/abs/2508.16357v1
Date: Fri, 22 Aug 2025 13:04:43 GMT
Title: MizanQA: Benchmarking Large Language Models on Moroccan Legal Question Answering
Authors: Adil Bahaj, Mounir Ghogho,
Abstract summary: This paper introduces MizanQA (pronounced Mizan, meaning "scale" in Arabic), a benchmark to evaluate large language models (LLMs)<n>The dataset draws on Modern Standard Arabic, Islamic Maliki jurisprudence, Moroccan customary law, and French legal influences.<n> Benchmarking experiments with multilingual and Arabic-focused LLMs reveal substantial performance gaps.
Score: 13.01152821327721
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The rapid advancement of large language models (LLMs) has significantly propelled progress in natural language processing (NLP). However, their effectiveness in specialized, low-resource domains-such as Arabic legal contexts-remains limited. This paper introduces MizanQA (pronounced Mizan, meaning "scale" in Arabic, a universal symbol of justice), a benchmark designed to evaluate LLMs on Moroccan legal question answering (QA) tasks, characterised by rich linguistic and legal complexity. The dataset draws on Modern Standard Arabic, Islamic Maliki jurisprudence, Moroccan customary law, and French legal influences. Comprising over 1,700 multiple-choice questions, including multi-answer formats, MizanQA captures the nuances of authentic legal reasoning. Benchmarking experiments with multilingual and Arabic-focused LLMs reveal substantial performance gaps, highlighting the need for tailored evaluation metrics and culturally grounded, domain-specific LLM development.

Related papers

DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models [54.10223256792762]
We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects.<n>We extend the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects.
arXiv Detail & Related papers (2025-10-31T15:17:06Z)
Evaluating the Limits of Large Language Models in Multilingual Legal Reasoning [5.902003660308139]
This work evaluates LLaMA and Gemini on multilingual legal and non-legal benchmarks.<n>We present an open-source, modular evaluation pipeline designed to support multilingual, task-diverse benchmarking.
arXiv Detail & Related papers (2025-09-26T15:19:12Z)
HeQ: a Large and Diverse Hebrew Reading Comprehension Benchmark [54.73504952691398]
We set out to deliver a Hebrew Machine Reading dataset as extractive Questioning.<n>The morphologically rich nature of Hebrew poses a challenge to this endeavor.<n>We devise a novel set of guidelines, a controlled crowdsourcing protocol, and revised evaluation metrics.
arXiv Detail & Related papers (2025-08-03T15:53:01Z)
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation [86.7047714187813]
MMLU-ProX is a benchmark covering 29 languages, built on an English benchmark.<n>Each language version consists of 11,829 identical questions, enabling direct cross-linguistic comparisons.<n>To meet efficient evaluation needs, we provide a lite version containing 658 questions per language.
arXiv Detail & Related papers (2025-03-13T15:59:20Z)
Can Large Language Models Predict the Outcome of Judicial Decisions? [0.0]
Large Language Models (LLMs) have shown exceptional capabilities in Natural Language Processing (NLP)<n>We benchmark state-of-the-art open-source LLMs, including LLaMA-3.2-3B and LLaMA-3.1-8B, under varying configurations.<n>Our results demonstrate that fine-tuned smaller models achieve comparable performance to larger models in task-specific contexts.
arXiv Detail & Related papers (2025-01-15T11:32:35Z)
All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages [73.93600813999306]
ALM-bench is the largest and most comprehensive effort to date for evaluating LMMs across 100 languages.<n>It challenges existing models by testing their ability to understand and reason about culturally diverse images paired with text in various languages.<n>The benchmark offers a robust and nuanced evaluation framework featuring various question formats, including true/false, multiple choice, and open-ended questions.
arXiv Detail & Related papers (2024-11-25T15:44:42Z)
Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks [68.33068005789116]
We introduce ReDial, a benchmark containing 1.2K+ parallel query pairs in Standardized English and AAVE.<n>We evaluate widely used models, including GPT, Claude, Llama, Mistral, and the Phi model families.<n>Our work establishes a systematic and objective framework for analyzing LLM bias in dialectal queries.
arXiv Detail & Related papers (2024-10-14T18:44:23Z)
AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs [22.121471902726892]
We present AraDiCE, a benchmark for Arabic Dialect and Cultural Evaluation.<n>First-ever fine-grained benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and Levant regions.
arXiv Detail & Related papers (2024-09-17T17:59:25Z)
ArabLegalEval: A Multitask Benchmark for Assessing Arabic Legal Knowledge in Large Language Models [0.0]
ArabLegalEval is a benchmark dataset for assessing the Arabic legal knowledge of Large Language Models (LLMs) Inspired by the MMLU and LegalBench datasets, ArabLegalEval consists of multiple tasks sourced from Saudi legal documents and synthesized questions. We aim to analyze the capabilities required to solve legal problems in Arabic and benchmark the performance of state-of-the-art LLMs.
arXiv Detail & Related papers (2024-08-15T07:09:51Z)
Peacock: A Family of Arabic Multimodal Large Language Models and Benchmarks [29.819766942335416]
Multimodal large language models (MLLMs) have proven effective in a wide range of tasks requiring complex reasoning and linguistic comprehension. We introduce a comprehensive family of Arabic MLLMs, dubbed textitPeacock, with strong vision and language capabilities.
arXiv Detail & Related papers (2024-03-01T23:38:02Z)
ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language. Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region. Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z)
One Law, Many Languages: Benchmarking Multilingual Legal Reasoning for Judicial Support [18.810320088441678]
This work introduces a novel NLP benchmark for the legal domain. It challenges LLMs in five key dimensions: processing emphlong documents (up to 50K tokens), using emphdomain-specific knowledge (embodied in legal texts) and emphmultilingual understanding (covering five languages) Our benchmark contains diverse datasets from the Swiss legal system, allowing for a comprehensive study of the underlying non-English, inherently multilingual legal system.
arXiv Detail & Related papers (2023-06-15T16:19:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.