ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic
- URL: http://arxiv.org/abs/2101.01785v1
- Date: Sun, 27 Dec 2020 06:32:55 GMT
- Title: ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic
- Authors: Muhammad Abdul-Mageed, AbdelRahim Elmadany, El Moatez Billah Nagoudi
- Abstract summary: We introduce two powerful deep bidirectional transformer-based models, ARBERT and MARBERT, that have superior performance to all existing models.
ArBench is built using 41 datasets targeting 5 different tasks/task clusters.
When fine-tuned on ArBench, ARBERT and MARBERT collectively achieve new SOTA with sizeable margins compared to all existing models.
- Score: 6.021269454707625
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked language models (MLM) have become an integral part of many natural
language processing systems. Although multilingual MLMs have been introduced to
serve many languages, these have limitations as to their capacity and the size
and diversity of non-English data they are pre-trained on. In this work, we
remedy these issues for Arabic by introducing two powerful deep bidirectional
transformer-based models, ARBERT and MARBERT, that have superior performance to
all existing models. To evaluate our models, we propose ArBench, a new
benchmark for multi-dialectal Arabic language understanding. ArBench is built
using 41 datasets targeting 5 different tasks/task clusters, allowing us to
offer a series of standardized experiments under rich conditions. When
fine-tuned on ArBench, ARBERT and MARBERT collectively achieve new SOTA with
sizeable margins compared to all existing models such as mBERT, XLM-R (Base and
Large), and AraBERT on 37 out of 45 classification tasks on the 41 datasets
(%82.22). Our models are publicly available for research.
Related papers
- Swan and ArabicMTEB: Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks [17.5987429821102]
Swan is a family of embedding models centred around the Arabic language.
Two variants: Swan-Small, based on ARBERTv2, and Swan-Large, built on ArMistral, a pretrained Arabic large language model.
arXiv Detail & Related papers (2024-11-02T09:39:49Z) - Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect [45.755756115243486]
We construct our instruction dataset by consolidating existing Darija language resources.
Atlas-Chat-2B, 9B, and 27B models, fine-tuned on the dataset, exhibit superior ability in following Darija instructions.
arXiv Detail & Related papers (2024-09-26T14:56:38Z) - AlcLaM: Arabic Dialectal Language Model [2.8477895544986955]
We construct an Arabic dialectal corpus comprising 3.4M sentences gathered from social media platforms.
We utilize this corpus to expand the vocabulary and retrain a BERT-based model from scratch.
Named AlcLaM, our model was trained using only 13 GB of text, which represents a fraction of the data used by existing models.
arXiv Detail & Related papers (2024-07-18T02:13:50Z) - GemmAr: Enhancing LLMs Through Arabic Instruction-Tuning [0.0]
We introduce InstAr-500k, a new Arabic instruction dataset created by generating and collecting content.
We assess this dataset by fine-tuning an open-source Gemma-7B model on several downstream tasks to improve its functionality.
Based on multiple evaluations, our fine-tuned model achieves excellent performance on several Arabic NLP benchmarks.
arXiv Detail & Related papers (2024-07-02T10:43:49Z) - ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region.
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open
Generative Large Language Models [57.76998376458017]
We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric foundation and instruction-tuned open generative large language models (LLMs)
The models are based on the GPT-3 decoder-only architecture and are pretrained on a mixture of Arabic and English texts.
We provide a detailed description of the training, the tuning, the safety alignment, and the evaluation of the models.
arXiv Detail & Related papers (2023-08-30T17:07:17Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - AraMUS: Pushing the Limits of Data and Model Scale for Arabic Natural
Language Processing [25.5682279613992]
We present AraMUS, the largest Arabic PLM with 11B parameters trained on 529GB of high-quality Arabic textual data.
AraMUS achieves state-of-the-art performances on a diverse set of Arabic classification and generative tasks.
arXiv Detail & Related papers (2023-06-11T22:55:18Z) - BLOOM: A 176B-Parameter Open-Access Multilingual Language Model [264.96498474333697]
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions.
We present BLOOM, a 176B- parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers.
BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages.
arXiv Detail & Related papers (2022-11-09T18:48:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.