JABER: Junior Arabic BERt
- URL: http://arxiv.org/abs/2112.04329v1
- Date: Wed, 8 Dec 2021 15:19:24 GMT
- Title: JABER: Junior Arabic BERt
- Authors: Abbas Ghaddar, Yimeng Wu, Ahmad Rashid, Khalil Bibi, Mehdi
Rezagholizadeh, Chao Xing, Yasheng Wang, Duan Xinyu, Zhefeng Wang, Baoxing
Huai, Xin Jiang, Qun Liu and Philippe Langlais
- Abstract summary: We present JABER, Junior Arabic BERt, our pretrained language model prototype dedicated for Arabic.
We conduct an empirical study to systematically evaluate the performance of models across a diverse set of existing Arabic NLU tasks.
- Score: 37.174723137868675
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language-specific pre-trained models have proven to be more accurate than
multilingual ones in a monolingual evaluation setting, Arabic is no exception.
However, we found that previously released Arabic BERT models were
significantly under-trained. In this technical report, we present JABER, Junior
Arabic BERt, our pretrained language model prototype dedicated for Arabic. We
conduct an empirical study to systematically evaluate the performance of models
across a diverse set of existing Arabic NLU tasks. Experimental results show
that JABER achieves the state-of-the-art performances on ALUE, a new benchmark
for Arabic Language Understanding Evaluation, as well as on a well-established
NER benchmark
Related papers
- ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region.
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - On the importance of Data Scale in Pretraining Arabic Language Models [46.431706010614334]
We conduct a comprehensive study on the role of data in Arabic Pretrained Language Models (PLMs)
We reassess the performance of a suite of state-of-the-art Arabic PLMs by retraining them on massive-scale, high-quality Arabic corpora.
Our analysis strongly suggests that pretraining data by far is the primary contributor to performance, surpassing other factors.
arXiv Detail & Related papers (2024-01-15T15:11:15Z) - Analyzing Multilingual Competency of LLMs in Multi-Turn Instruction
Following: A Case Study of Arabic [1.0878040851638]
We employ GPT-4 as a uniform evaluator for both English and Arabic queries to assess and compare the performance of the LLMs on various open-ended tasks.
We find that fine-tuned base models using multilingual and multi-turn datasets could be competitive to models trained from scratch on multilingual data.
arXiv Detail & Related papers (2023-10-23T11:40:04Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open
Generative Large Language Models [57.76998376458017]
We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric foundation and instruction-tuned open generative large language models (LLMs)
The models are based on the GPT-3 decoder-only architecture and are pretrained on a mixture of Arabic and English texts.
We provide a detailed description of the training, the tuning, the safety alignment, and the evaluation of the models.
arXiv Detail & Related papers (2023-08-30T17:07:17Z) - JASMINE: Arabic GPT Models for Few-Shot Learning [20.311937206016445]
We release a suite of powerful Arabic autoregressive Transformer language models ranging in size between 300 million-6.7 billion parameters pretrained on a large and diverse dataset ( 235 GB of text)
We also carefully design and release a comprehensive benchmark for both automated and human evaluation of Arabic autoregressive models, with coverage of potential social biases, harms, and toxicity.
arXiv Detail & Related papers (2022-12-21T04:21:46Z) - Revisiting Pre-trained Language Models and their Evaluation for Arabic
Natural Language Understanding [44.048072667378115]
Existing Arabic PLMs are not well-explored and their pre-trainig can be improved significantly.
There is a lack of systematic and reproducible evaluation of these models in the literature.
We show that our models significantly outperform existing Arabic PLMs and achieve a new state-of-the-art performance on discriminative and generative Arabic NLU and NLG tasks.
arXiv Detail & Related papers (2022-05-21T22:38:19Z) - AraBERT: Transformer-based Model for Arabic Language Understanding [0.0]
We pre-trained BERT specifically for the Arabic language in the pursuit of achieving the same success that BERT did for the English language.
The results showed that the newly developed AraBERT achieved state-of-the-art performance on most tested Arabic NLP tasks.
arXiv Detail & Related papers (2020-02-28T22:59:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.