ArabianGPT: Native Arabic GPT-based Large Language Model
- URL: http://arxiv.org/abs/2402.15313v2
- Date: Mon, 26 Feb 2024 09:54:47 GMT
- Title: ArabianGPT: Native Arabic GPT-based Large Language Model
- Authors: Anis Koubaa, Adel Ammar, Lahouari Ghouti, Omar Najar, Serry Sibaee
- Abstract summary: This paper proposes ArabianGPT, a series of transformer-based models within the ArabianLLM suite designed explicitly for Arabic.
The AraNizer tokenizer, integral to these models, addresses the unique morphological aspects of Arabic script.
For sentiment analysis, the fine-tuned ArabianGPT-0.1B model achieved a remarkable accuracy of 95%, a substantial increase from the base model's 56%.
- Score: 2.8623940003518156
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The predominance of English and Latin-based large language models (LLMs) has
led to a notable deficit in native Arabic LLMs. This discrepancy is accentuated
by the prevalent inclusion of English tokens in existing Arabic models,
detracting from their efficacy in processing native Arabic's intricate
morphology and syntax. Consequently, there is a theoretical and practical
imperative for developing LLMs predominantly focused on Arabic linguistic
elements. To address this gap, this paper proposes ArabianGPT, a series of
transformer-based models within the ArabianLLM suite designed explicitly for
Arabic. These models, including ArabianGPT-0.1B and ArabianGPT-0.3B, vary in
size and complexity, aligning with the nuanced linguistic characteristics of
Arabic. The AraNizer tokenizer, integral to these models, addresses the unique
morphological aspects of Arabic script, ensuring more accurate text processing.
Empirical results from fine-tuning the models on tasks like sentiment analysis
and summarization demonstrate significant improvements. For sentiment analysis,
the fine-tuned ArabianGPT-0.1B model achieved a remarkable accuracy of 95%, a
substantial increase from the base model's 56%. Similarly, in summarization
tasks, fine-tuned models showed enhanced F1 scores, indicating improved
precision and recall in generating concise summaries. Comparative analysis of
fine-tuned ArabianGPT models against their base versions across various
benchmarks reveals nuanced differences in performance, with fine-tuning
positively impacting specific tasks like question answering and summarization.
These findings underscore the efficacy of fine-tuning in aligning ArabianGPT
models more closely with specific NLP tasks, highlighting the potential of
tailored transformer architectures in advancing Arabic NLP.
Related papers
- Enhancing Semantic Similarity Understanding in Arabic NLP with Nested Embedding Learning [0.6752538702870792]
This work presents a novel framework for training Arabic nested embedding models through Matryoshka Embedding Learning.
Our innovative contribution includes the translation of various sentence similarity datasets into Arabic.
We trained several embedding models on the Arabic Natural Language Inference triplet dataset and assessed their performance.
arXiv Detail & Related papers (2024-07-30T19:03:03Z) - ALLaM: Large Language Models for Arabic and English [9.881560166505452]
We present ALLaM: Arabic Large Language Model, a series of large language models to support the ecosystem of Arabic Language Technologies (ALT)
Our autoregressive decoder-only architecture models demonstrate how second-language acquisition via vocabulary expansion and pretraining can steer a model towards a new language (Arabic) without any catastrophic forgetting in the original language (English)
We show that extensive alignment with human preferences can significantly enhance the performance of a language model compared to models of a larger scale with lower quality alignment.
arXiv Detail & Related papers (2024-07-22T05:35:17Z) - GemmAr: Enhancing LLMs Through Arabic Instruction-Tuning [0.0]
We introduce InstAr-500k, a new Arabic instruction dataset created by generating and collecting content.
We assess this dataset by fine-tuning an open-source Gemma-7B model on several downstream tasks to improve its functionality.
Based on multiple evaluations, our fine-tuned model achieves excellent performance on several Arabic NLP benchmarks.
arXiv Detail & Related papers (2024-07-02T10:43:49Z) - ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region.
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - On the importance of Data Scale in Pretraining Arabic Language Models [46.431706010614334]
We conduct a comprehensive study on the role of data in Arabic Pretrained Language Models (PLMs)
We reassess the performance of a suite of state-of-the-art Arabic PLMs by retraining them on massive-scale, high-quality Arabic corpora.
Our analysis strongly suggests that pretraining data by far is the primary contributor to performance, surpassing other factors.
arXiv Detail & Related papers (2024-01-15T15:11:15Z) - Arabic Sentiment Analysis with Noisy Deep Explainable Model [48.22321420680046]
This paper proposes an explainable sentiment classification framework for the Arabic language.
The proposed framework can explain specific predictions by training a local surrogate explainable model.
We carried out experiments on public benchmark Arabic SA datasets.
arXiv Detail & Related papers (2023-09-24T19:26:53Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - ChatGPT for Arabic Grammatical Error Correction [5.945320097465418]
Large language models (LLMs) fine-tuned to follow human instruction have exhibited significant capabilities in English NLP tasks.
In this paper, we delve into abilities of instruction fine-tuned LLMs in Arabic GEC, a task made complex due to Arabic's rich morphology.
We find that instruction fine-tuned models, regardless of their size, significantly underperform compared to fully fine-tuned models of significantly smaller sizes.
arXiv Detail & Related papers (2023-08-08T18:00:39Z) - Rethinking Masked Language Modeling for Chinese Spelling Correction [70.85829000570203]
We study Chinese Spelling Correction (CSC) as a joint decision made by two separate models: a language model and an error model.
We find that fine-tuning BERT tends to over-fit the error model while under-fit the language model, resulting in poor generalization to out-of-distribution error patterns.
We demonstrate that a very simple strategy, randomly masking 20% non-error tokens from the input sequence during fine-tuning is sufficient for learning a much better language model without sacrificing the error model.
arXiv Detail & Related papers (2023-05-28T13:19:12Z) - A Sequence-to-Sequence Approach for Arabic Pronoun Resolution [0.0]
This paper proposes a sequence-to-sequence learning approach for Arabic pronoun resolution.
The proposed approach is evaluated on the AnATAr dataset.
arXiv Detail & Related papers (2023-05-19T08:53:41Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.