Related papers: CVPD at QIAS 2025 Shared Task: An Efficient Encoder-Based Approach for Islamic Inheritance Reasoning

CVPD at QIAS 2025 Shared Task: An Efficient Encoder-Based Approach for Islamic Inheritance Reasoning

URL: http://arxiv.org/abs/2509.00457v2
Date: Fri, 05 Sep 2025 20:27:56 GMT
Title: CVPD at QIAS 2025 Shared Task: An Efficient Encoder-Based Approach for Islamic Inheritance Reasoning
Authors: Salah Eddine Bekhouche, Abdellah Zakaria Sellam, Hichem Telli, Cosimo Distante, Abdenour Hadid,
Abstract summary: Islamic inheritance law (Ilm al-Mawarith) requires precise identification of heirs and calculation of shares.<n>We present a framework for solving inheritance questions using a specialised Arabic text encoder and Attentive Relevance Scoring (ARS)<n>The system ranks answer options according to semantic relevance, and enables fast, on-device inference without generative reasoning.
Score: 6.5255476646093316
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Islamic inheritance law (Ilm al-Mawarith) requires precise identification of heirs and calculation of shares, which poses a challenge for AI. In this paper, we present a lightweight framework for solving multiple-choice inheritance questions using a specialised Arabic text encoder and Attentive Relevance Scoring (ARS). The system ranks answer options according to semantic relevance, and enables fast, on-device inference without generative reasoning. We evaluate Arabic encoders (MARBERT, ArabicBERT, AraBERT) and compare them with API-based LLMs (Gemini, DeepSeek) on the QIAS 2025 dataset. While large models achieve an accuracy of up to 87.6%, they require more resources and are context-dependent. Our MARBERT-based approach achieves 69.87% accuracy, presenting a compelling case for efficiency, on-device deployability, and privacy. While this is lower than the 87.6% achieved by the best-performing LLM, our work quantifies a critical trade-off between the peak performance of large models and the practical advantages of smaller, specialized systems in high-stakes domains.

Related papers

ALPS: A Diagnostic Challenge Set for Arabic Linguistic & Pragmatic Reasoning [0.0]
ALPS (Arabic Linguistic & Pragmatic Suite) is a native, expert-curated diagnostic challenge set probing Deep Semantics and Pragmatics.<n> ALPS targets the depth of linguistic understanding through 531 rigorously crafted questions across 15 tasks and 47 subtasks.<n>We developed the dataset with deep expertise in Arabic linguistics, guaranteeing cultural authenticity and eliminating translation artifacts.
arXiv Detail & Related papers (2026-02-19T03:51:37Z)
Agent Skill Framework: Perspectives on the Potential of Small Language Models in Industrial Environments [14.079091139464175]
This work introduces a formal mathematical definition of the Agent Skill process, followed by a systematic evaluation of language models of varying sizes.<n>Results show that tiny models struggle with reliable skill selection, while moderately sized SLMs (approximately 12B - 30B) benefit substantially from the Agent Skill approach.
arXiv Detail & Related papers (2026-02-18T17:52:17Z)
ELAIPBench: A Benchmark for Expert-Level Artificial Intelligence Paper Understanding [49.67493845115009]
ELAIPBench is a benchmark curated by domain experts to evaluate large language models' comprehension of AI research papers.<n>It spans three difficulty levels and emphasizes non-trivial reasoning rather than shallow retrieval.<n>Experiments show that the best-performing LLM achieves an accuracy of only 39.95%, far below human performance.
arXiv Detail & Related papers (2025-10-12T11:11:20Z)
Evaluating the Promise and Pitfalls of LLMs in Hiring Decisions [1.1883838320818292]
Large language models (LLMs) in hiring promise to streamline candidate screening, but it also raises serious concerns regarding accuracy and algorithmic bias.<n>We benchmark several state-of-the-art foundational LLMs and compare them with our proprietary domain-specific hiring model (Match Score) for job candidate matching.<n>Our experiments show that Match Score outperforms the general-purpose LLMs on accuracy (ROC AUC 0.85 vs 0.77) and achieves significantly more equitable outcomes across demographic groups.
arXiv Detail & Related papers (2025-07-02T19:02:18Z)
Product of Experts with LLMs: Boosting Performance on ARC Is a Matter of Perspective [3.2771631221674333]
We leverage task-specific data augmentations throughout the training, generation, and scoring phases.<n>We employ a depth-first search algorithm to generate diverse, high-probability candidate solutions.<n>Our method achieves a score of 71.6% (286.5/400 solved tasks) on the public ARC-AGI evaluation set.
arXiv Detail & Related papers (2025-05-08T11:17:10Z)
MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges? [64.62421656031128]
MLRC-Bench is a benchmark designed to quantify how effectively language agents can tackle challenging Machine Learning (ML) Research Competitions.<n>Unlike prior work, MLRC-Bench measures the key steps of proposing and implementing novel research methods.<n>Even the best-performing tested agent closes only 9.3% of the gap between baseline and top human participant scores.
arXiv Detail & Related papers (2025-04-13T19:35:43Z)
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines [118.8024915014751]
Large language models (LLMs) have demonstrated remarkable proficiency in academic disciplines such as mathematics, physics, and computer science.<n>However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks.<n>We present SuperGPQA, a benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines.
arXiv Detail & Related papers (2025-02-20T17:05:58Z)
How well can LLMs Grade Essays in Arabic? [3.101490720236325]
This research assesses the effectiveness of large language models (LLMs) in the task of Arabic automated essay scoring (AES) using the AR-AES dataset.<n>It explores various evaluation methodologies, including zero-shot, few-shot in-context learning, and fine-tuning.<n>A mixed-language prompting strategy, integrating English prompts with Arabic content, was implemented to improve model comprehension and performance.
arXiv Detail & Related papers (2025-01-27T21:30:02Z)
Can Large Language Models Predict the Outcome of Judicial Decisions? [0.0]
Large Language Models (LLMs) have shown exceptional capabilities in Natural Language Processing (NLP)<n>We benchmark state-of-the-art open-source LLMs, including LLaMA-3.2-3B and LLaMA-3.1-8B, under varying configurations.<n>Our results demonstrate that fine-tuned smaller models achieve comparable performance to larger models in task-specific contexts.
arXiv Detail & Related papers (2025-01-15T11:32:35Z)
Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities.<n>LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands.<n>We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z)
EVOLvE: Evaluating and Optimizing LLMs For In-Context Exploration [76.66831821738927]
Large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty.<n>We measure LLMs' (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications.<n>Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs.
arXiv Detail & Related papers (2024-10-08T17:54:03Z)
DARA: Decomposition-Alignment-Reasoning Autonomous Language Agent for Question Answering over Knowledge Graphs [70.54226917774933]
We propose the DecompositionAlignment-Reasoning Agent (DARA) framework. DARA effectively parses questions into formal queries through a dual mechanism. We show that DARA attains performance comparable to state-of-the-art enumerating-and-ranking-based methods for KGQA.
arXiv Detail & Related papers (2024-06-11T09:09:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.