Related papers: Tahakom LLM Guidelines and Recipes: From Pre-training Data to an Arabic LLM

Tahakom LLM Guidelines and Recipes: From Pre-training Data to an Arabic LLM

URL: http://arxiv.org/abs/2510.13481v2
Date: Sun, 26 Oct 2025 13:05:20 GMT
Title: Tahakom LLM Guidelines and Recipes: From Pre-training Data to an Arabic LLM
Authors: Areej AlOtaibi, Lina Alyahya, Raghad Alshabanah, Shahad Alfawzan, Shuruq Alarefei, Reem Alsabti, Nouf Alsubaie, Abdulaziz Alhuzaymi, Lujain Alkhelb, Majd Alsayari, Waad Alahmed, Omar Talabay, Jalal Alowibdi, Salem Alelyani, Adel Bibi,
Abstract summary: Large Language Models (LLMs) have significantly advanced the field of natural language processing.<n>However, developing LLMs for Arabic presents unique challenges.<n>This paper explores these challenges by focusing on critical aspects such as data curation, tokenizer design, and evaluation.
Score: 13.961748369867777
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have significantly advanced the field of natural language processing, enhancing capabilities in both language understanding and generation across diverse domains. However, developing LLMs for Arabic presents unique challenges. This paper explores these challenges by focusing on critical aspects such as data curation, tokenizer design, and evaluation. We detail our approach to the collection and filtration of Arabic pre-training datasets, assess the impact of various tokenizer designs on model performance, and examine the limitations of existing Arabic evaluation frameworks, for which we propose a systematic corrective methodology. To promote transparency and facilitate collaborative development, we share our data and methodologies, contributing to the advancement of language modeling, particularly for the Arabic language.

Related papers

Dhati+: Fine-tuned Large Language Models for Arabic Subjectivity Evaluation [0.0]
Despite its significance, Arabic faces the challenge of being under-resourced.<n>The scarcity of large annotated datasets hampers the development of accurate tools for subjectivity analysis in Arabic.<n>Recent advances in deep learning and Transformers have proven highly effective for text classification in English and French.
arXiv Detail & Related papers (2025-08-27T15:20:12Z)
Mind the Gap: A Review of Arabic Post-Training Datasets and Their Limitations [3.4379069363635626]
This paper presents a review of publicly available Arabic post-training datasets on the Hugging Face Hub.<n>Each dataset is rigorously evaluated based on popularity, practical adoption, recency and maintenance, documentation and annotation quality, licensing transparency, and scientific contribution.
arXiv Detail & Related papers (2025-07-19T16:30:45Z)
Improving Multilingual Math Reasoning for African Languages [49.27985213689457]
We conduct experiments to evaluate different combinations of data types (translated versus synthetically generated), training stages (pre-training versus post-training), and other model adaptation configurations.<n>Our experiments focuses on mathematical reasoning tasks, using the Llama 3.1 model family as our base model.
arXiv Detail & Related papers (2025-05-26T11:35:01Z)
Jawaher: A Multidialectal Dataset of Arabic Proverbs for LLM Benchmarking [12.078532717928185]
Large language models (LLMs) continue to exhibit biases toward Western, Anglo-centric, or American cultures.<n>We introduce Jawaher, a benchmark designed to assess LLMs' capacity to comprehend and interpret Arabic proverbs.<n>We find that while LLMs can generate idiomatically accurate translations, they struggle with producing culturally nuanced and contextually relevant explanations.
arXiv Detail & Related papers (2025-02-28T22:28:00Z)
How well can LLMs Grade Essays in Arabic? [3.101490720236325]
This research assesses the effectiveness of large language models (LLMs) in the task of Arabic automated essay scoring (AES) using the AR-AES dataset.<n>It explores various evaluation methodologies, including zero-shot, few-shot in-context learning, and fine-tuning.<n>A mixed-language prompting strategy, integrating English prompts with Arabic content, was implemented to improve model comprehension and performance.
arXiv Detail & Related papers (2025-01-27T21:30:02Z)
ArabLegalEval: A Multitask Benchmark for Assessing Arabic Legal Knowledge in Large Language Models [0.0]
ArabLegalEval is a benchmark dataset for assessing the Arabic legal knowledge of Large Language Models (LLMs) Inspired by the MMLU and LegalBench datasets, ArabLegalEval consists of multiple tasks sourced from Saudi legal documents and synthesized questions. We aim to analyze the capabilities required to solve legal problems in Arabic and benchmark the performance of state-of-the-art LLMs.
arXiv Detail & Related papers (2024-08-15T07:09:51Z)
A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers [51.8203871494146]
The rapid development of Large Language Models (LLMs) demonstrates remarkable multilingual capabilities in natural language processing.<n>Despite the breakthroughs of LLMs, the investigation into the multilingual scenario remains insufficient.<n>This survey aims to help the research community address multilingual problems and provide a comprehensive understanding of the core concepts, key techniques, and latest developments in multilingual natural language processing based on LLMs.
arXiv Detail & Related papers (2024-05-17T17:47:39Z)
Bridging the Bosphorus: Advancing Turkish Large Language Models through Strategies for Low-Resource Language Adaptation and Benchmarking [1.3716808114696444]
Large Language Models (LLMs) are becoming crucial across various fields, emphasizing the urgency for high-quality models in underrepresented languages. This study explores the unique challenges faced by low-resource languages, such as data scarcity, model selection, evaluation, and computational limitations.
arXiv Detail & Related papers (2024-05-07T21:58:45Z)
History, Development, and Principles of Large Language Models-An Introductory Survey [15.875687167037206]
Language models serve as a cornerstone in natural language processing (NLP) Over extensive research spanning decades, language modeling has progressed from initial statistical language models (SLMs) to the contemporary landscape of large language models (LLMs)
arXiv Detail & Related papers (2024-02-10T01:18:15Z)
Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering. The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored. We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z)
AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z)
Cross-Lingual NER for Financial Transaction Data in Low-Resource Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data. We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information. With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z)
A Survey of Large Language Models [81.06947636926638]
Language modeling has been widely studied for language understanding and generation in the past two decades. Recently, pre-trained language models (PLMs) have been proposed by pre-training Transformer models over large-scale corpora. To discriminate the difference in parameter scale, the research community has coined the term large language models (LLM) for the PLMs of significant size.
arXiv Detail & Related papers (2023-03-31T17:28:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.