Related papers: AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP

AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP

URL: http://arxiv.org/abs/2506.08768v2
Date: Wed, 11 Jun 2025 09:00:02 GMT
Title: AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP
Authors: Ahmed Hasanaath, Aisha Alansari, Ahmed Ashraf, Chafik Salmane, Hamzah Luqman, Saad Ezzini,
Abstract summary: Large language models (LLMs) have shown remarkable progress in reasoning abilities.<n>Yet their performance on Arabic data, characterized by rich morphology, diverse dialects, and complex script, remains underexplored.<n>This paper presents a comprehensive benchmarking study of multiple reasoning-focused LLMs.
Score: 2.869780207429188
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have shown remarkable progress in reasoning abilities and general natural language processing (NLP) tasks, yet their performance on Arabic data, characterized by rich morphology, diverse dialects, and complex script, remains underexplored. This paper presents a comprehensive benchmarking study of multiple reasoning-focused LLMs, with a special emphasis on the newly introduced DeepSeek models, across a suite of fifteen Arabic NLP tasks. We experiment with various strategies, including zero-shot, few-shot, and fine-tuning. This allows us to systematically evaluate performance on datasets covering a range of applications to examine their capacity for linguistic reasoning under different levels of complexity. Our experiments reveal several key findings. First, carefully selecting just three in-context examples delivers an average uplift of over 13 F1 points on classification tasks-boosting sentiment analysis from 35.3% to 87.5% and paraphrase detection from 56.1% to 87.0%. Second, reasoning-focused DeepSeek architectures outperform a strong GPT o4-mini baseline by an average of 12 F1 points on complex inference tasks in the zero-shot setting. Third, LoRA-based fine-tuning yields up to an additional 8 points in F1 and BLEU compared to equivalent increases in model scale. The code is available at https://anonymous.4open.science/r/AraReasoner41299

Related papers

Beyond Chains of Thought: Benchmarking Latent-Space Reasoning Abilities in Large Language Models [0.0]
Large language models (LLMs) can perform reasoning computations both internally within their latent space and externally.<n>This study introduces a benchmark designed to quantify model-internal reasoning in different domains.
arXiv Detail & Related papers (2025-04-14T18:15:27Z)
Do we still need Human Annotators? Prompting Large Language Models for Aspect Sentiment Quad Prediction [2.2999148299770047]
This study explores the capabilities of large language models (LLMs) for zero- and few-shot learning on the ASQP task.<n>We report F1 scores almost up to par with those obtained with state-of-the-art fine-tuned models and exceeding previously reported zero- and few-shot performance.
arXiv Detail & Related papers (2025-02-18T16:56:15Z)
SLAM: Towards Efficient Multilingual Reasoning via Selective Language Alignment [78.4550589538805]
We propose an efficient multilingual reasoning alignment approach that precisely identifies and fine-tunes the layers responsible for handling multilingualism.<n> Experimental results show that our method, SLAM, only tunes 6 layers' feed-forward sub-layers including 6.5-8% of all parameters within 7B and 13B LLMs.
arXiv Detail & Related papers (2025-01-07T10:29:43Z)
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale [66.73529246309033]
multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks.<n>Existing instruction-tuning datasets only provide phrase-level answers without any intermediate rationales.<n>We introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales.
arXiv Detail & Related papers (2024-12-06T18:14:24Z)
Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study [4.80612909282198]
This study introduces a new multi-task spatial evaluation dataset designed to explore and compare the performance of several advanced models on spatial tasks.<n>The dataset includes twelve distinct task types, such as spatial understanding and simple route planning, each with verified and accurate answers.
arXiv Detail & Related papers (2024-08-26T17:25:16Z)
Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z)
TriSum: Learning Summarization Ability from Large Language Models with Structured Rationale [66.01943465390548]
We introduce TriSum, a framework for distilling large language models' text summarization abilities into a compact, local model. Our method enhances local model performance on various benchmarks. It also improves interpretability by providing insights into the summarization rationale.
arXiv Detail & Related papers (2024-03-15T14:36:38Z)
Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs. We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages. By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z)
LAraBench: Benchmarking Arabic AI with Large Language Models [26.249084464525044]
LAraBench addresses this gap for Arabic Natural Language Processing (NLP) and Speech Processing tasks. We utilize models such as GPT-3.5-turbo, GPT-4, BLOOMZ, Jais-13b-chat, Whisper, and USM to tackle 33 distinct tasks across 61 publicly available datasets. This involved 98 experimental setups, encompassing 296K data points, 46 hours of speech, and 30 sentences for Text-to-Speech (TTS)
arXiv Detail & Related papers (2023-05-24T10:16:16Z)
Text Classification via Large Language Models [63.1874290788797]
We introduce Clue And Reasoning Prompting (CARP) to address complex linguistic phenomena involved in text classification. Remarkably, CARP yields new SOTA performances on 4 out of 5 widely-used text-classification benchmarks. More importantly, we find that CARP delivers impressive abilities on low-resource and domain-adaptation setups.
arXiv Detail & Related papers (2023-05-15T06:24:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.