A Survey of Large Language Models for Arabic Language and its Dialects
- URL: http://arxiv.org/abs/2410.20238v1
- Date: Sat, 26 Oct 2024 17:48:20 GMT
- Title: A Survey of Large Language Models for Arabic Language and its Dialects
- Authors: Malak Mashaabi, Shahad Al-Khalifa, Hend Al-Khalifa,
- Abstract summary: This survey offers a comprehensive overview of Large Language Models (LLMs) designed for Arabic language and its dialects.
It covers key architectures, including encoder-only, decoder-only, and encoder-decoder models, along with the datasets used for pre-training.
The study also explores monolingual, bilingual, and multilingual LLMs, analyzing their architectures and performance across downstream tasks.
- Score: 0.0
- License:
- Abstract: This survey offers a comprehensive overview of Large Language Models (LLMs) designed for Arabic language and its dialects. It covers key architectures, including encoder-only, decoder-only, and encoder-decoder models, along with the datasets used for pre-training, spanning Classical Arabic, Modern Standard Arabic, and Dialectal Arabic. The study also explores monolingual, bilingual, and multilingual LLMs, analyzing their architectures and performance across downstream tasks, such as sentiment analysis, named entity recognition, and question answering. Furthermore, it assesses the openness of Arabic LLMs based on factors, such as source code availability, training data, model weights, and documentation. The survey highlights the need for more diverse dialectal datasets and attributes the importance of openness for research reproducibility and transparency. It concludes by identifying key challenges and opportunities for future research and stressing the need for more inclusive and representative models.
Related papers
- AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs [22.121471902726892]
We present AraDiCE, a benchmark for Arabic Dialect and Cultural Evaluation.
First-ever fine-grained benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and Levant regions.
We will release the dialectal translation models and benchmarks curated in this study.
arXiv Detail & Related papers (2024-09-17T17:59:25Z) - ArabLegalEval: A Multitask Benchmark for Assessing Arabic Legal Knowledge in Large Language Models [0.0]
ArabLegalEval is a benchmark dataset for assessing the Arabic legal knowledge of Large Language Models (LLMs)
Inspired by the MMLU and LegalBench datasets, ArabLegalEval consists of multiple tasks sourced from Saudi legal documents and synthesized questions.
We aim to analyze the capabilities required to solve legal problems in Arabic and benchmark the performance of state-of-the-art LLMs.
arXiv Detail & Related papers (2024-08-15T07:09:51Z) - 101 Billion Arabic Words Dataset [0.0]
This study aims to address the data scarcity in the Arab world and to encourage the development of Arabic Language Models.
We undertook a large-scale data mining project, extracting a substantial volume of text from the Common Crawl WET files.
The extracted data underwent a rigorous cleaning and deduplication process, using innovative techniques to ensure the integrity and uniqueness of the dataset.
arXiv Detail & Related papers (2024-04-29T13:15:03Z) - Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA)
We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA).
Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - ArabicaQA: A Comprehensive Dataset for Arabic Question Answering [13.65056111661002]
We introduce ArabicaQA, the first large-scale dataset for machine reading comprehension and open-domain question answering in Arabic.
We also present AraDPR, the first dense passage retrieval model trained on the Arabic Wikipedia corpus.
arXiv Detail & Related papers (2024-03-26T16:37:54Z) - ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region.
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open
Generative Large Language Models [57.76998376458017]
We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric foundation and instruction-tuned open generative large language models (LLMs)
The models are based on the GPT-3 decoder-only architecture and are pretrained on a mixture of Arabic and English texts.
We provide a detailed description of the training, the tuning, the safety alignment, and the evaluation of the models.
arXiv Detail & Related papers (2023-08-30T17:07:17Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.