ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic
- URL: http://arxiv.org/abs/2402.12840v1
- Date: Tue, 20 Feb 2024 09:07:41 GMT
- Title: ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic
- Authors: Fajri Koto and Haonan Li and Sara Shatnawi and Jad Doughman and
Abdelrahman Boda Sadallah and Aisha Alraeesi and Khalid Almubarak and Zaid
Alyafeai and Neha Sengupta and Shady Shehata and Nizar Habash and Preslav
Nakov and Timothy Baldwin
- Abstract summary: We present ArabicMMLU, the first multi-task language understanding benchmark for Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA)
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
- Score: 53.1913348687902
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The focus of language model evaluation has transitioned towards reasoning and
knowledge-intensive tasks, driven by advancements in pretraining large models.
While state-of-the-art models are partially trained on large Arabic texts,
evaluating their performance in Arabic remains challenging due to the limited
availability of relevant datasets. To bridge this gap, we present ArabicMMLU,
the first multi-task language understanding benchmark for Arabic language,
sourced from school exams across diverse educational levels in different
countries spanning North Africa, the Levant, and the Gulf regions. Our data
comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard
Arabic (MSA), and is carefully constructed by collaborating with native
speakers in the region. Our comprehensive evaluations of 35 models reveal
substantial room for improvement, particularly among the best open-source
models. Notably, BLOOMZ, mT0, LLama2, and Falcon struggle to achieve a score of
50%, while even the top-performing Arabic-centric model only achieves a score
of 62.3%.
Related papers
- AlcLaM: Arabic Dialectal Language Model [2.8477895544986955]
We construct an Arabic dialectal corpus comprising 3.4M sentences gathered from social media platforms.
We utilize this corpus to expand the vocabulary and retrain a BERT-based model from scratch.
Named AlcLaM, our model was trained using only 13 GB of text, which represents a fraction of the data used by existing models.
arXiv Detail & Related papers (2024-07-18T02:13:50Z) - The Qiyas Benchmark: Measuring ChatGPT Mathematical and Language Understanding in Arabic [0.0]
We introduce two novel benchmarks designed to evaluate models' mathematical reasoning and language understanding abilities in Arabic.
These benchmarks are derived from a General Aptitude Test (GAT) called Qiyas exam, a standardized test widely used for university admissions in Saudi Arabia.
arXiv Detail & Related papers (2024-06-28T16:34:31Z) - On the importance of Data Scale in Pretraining Arabic Language Models [46.431706010614334]
We conduct a comprehensive study on the role of data in Arabic Pretrained Language Models (PLMs)
We reassess the performance of a suite of state-of-the-art Arabic PLMs by retraining them on massive-scale, high-quality Arabic corpora.
Our analysis strongly suggests that pretraining data by far is the primary contributor to performance, surpassing other factors.
arXiv Detail & Related papers (2024-01-15T15:11:15Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open
Generative Large Language Models [57.76998376458017]
We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric foundation and instruction-tuned open generative large language models (LLMs)
The models are based on the GPT-3 decoder-only architecture and are pretrained on a mixture of Arabic and English texts.
We provide a detailed description of the training, the tuning, the safety alignment, and the evaluation of the models.
arXiv Detail & Related papers (2023-08-30T17:07:17Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - LAraBench: Benchmarking Arabic AI with Large Language Models [26.249084464525044]
LAraBench addresses this gap for Arabic Natural Language Processing (NLP) and Speech Processing tasks.
We utilize models such as GPT-3.5-turbo, GPT-4, BLOOMZ, Jais-13b-chat, Whisper, and USM to tackle 33 distinct tasks across 61 publicly available datasets.
This involved 98 experimental setups, encompassing 296K data points, 46 hours of speech, and 30 sentences for Text-to-Speech (TTS)
arXiv Detail & Related papers (2023-05-24T10:16:16Z) - ORCA: A Challenging Benchmark for Arabic Language Understanding [8.9379057739817]
ORCA is a publicly available benchmark for Arabic language understanding evaluation.
To measure current progress in Arabic NLU, we use ORCA to offer a comprehensive comparison between 18 multilingual and Arabic language models.
arXiv Detail & Related papers (2022-12-21T04:35:43Z) - Revisiting Pre-trained Language Models and their Evaluation for Arabic
Natural Language Understanding [44.048072667378115]
Existing Arabic PLMs are not well-explored and their pre-trainig can be improved significantly.
There is a lack of systematic and reproducible evaluation of these models in the literature.
We show that our models significantly outperform existing Arabic PLMs and achieve a new state-of-the-art performance on discriminative and generative Arabic NLU and NLG tasks.
arXiv Detail & Related papers (2022-05-21T22:38:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.