Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open
Generative Large Language Models
- URL: http://arxiv.org/abs/2308.16149v2
- Date: Fri, 29 Sep 2023 11:51:51 GMT
- Title: Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open
Generative Large Language Models
- Authors: Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan
Li, Fajri Koto, William Marshall, Gurpreet Gosal, Cynthia Liu, Zhiming Chen,
Osama Mohammed Afzal, Samta Kamboj, Onkar Pandit, Rahul Pal, Lalit Pradhan,
Zain Muhammad Mujahid, Massa Baali, Xudong Han, Sondos Mahmoud Bsharat, Alham
Fikri Aji, Zhiqiang Shen, Zhengzhong Liu, Natalia Vassilieva, Joel Hestness,
Andy Hock, Andrew Feldman, Jonathan Lee, Andrew Jackson, Hector Xuguang Ren,
Preslav Nakov, Timothy Baldwin, Eric Xing
- Abstract summary: We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric foundation and instruction-tuned open generative large language models (LLMs)
The models are based on the GPT-3 decoder-only architecture and are pretrained on a mixture of Arabic and English texts.
We provide a detailed description of the training, the tuning, the safety alignment, and the evaluation of the models.
- Score: 57.76998376458017
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Jais and Jais-chat, new state-of-the-art Arabic-centric
foundation and instruction-tuned open generative large language models (LLMs).
The models are based on the GPT-3 decoder-only architecture and are pretrained
on a mixture of Arabic and English texts, including source code in various
programming languages. With 13 billion parameters, they demonstrate better
knowledge and reasoning capabilities in Arabic than any existing open Arabic
and multilingual models by a sizable margin, based on extensive evaluation.
Moreover, the models are competitive in English compared to English-centric
open models of similar size, despite being trained on much less English data.
We provide a detailed description of the training, the tuning, the safety
alignment, and the evaluation of the models. We release two open versions of
the model -- the foundation Jais model, and an instruction-tuned Jais-chat
variant -- with the aim of promoting research on Arabic LLMs. Available at
https://huggingface.co/inception-mbzuai/jais-13b-chat
Related papers
- ALLaM: Large Language Models for Arabic and English [9.881560166505452]
We present ALLaM: Arabic Large Language Model, a series of large language models to support the ecosystem of Arabic Language Technologies (ALT)
Our autoregressive decoder-only architecture models demonstrate how second-language acquisition via vocabulary expansion and pretraining can steer a model towards a new language (Arabic) without any catastrophic forgetting in the original language (English)
We show that extensive alignment with human preferences can significantly enhance the performance of a language model compared to models of a larger scale with lower quality alignment.
arXiv Detail & Related papers (2024-07-22T05:35:17Z) - AlcLaM: Arabic Dialectal Language Model [2.8477895544986955]
We construct an Arabic dialectal corpus comprising 3.4M sentences gathered from social media platforms.
We utilize this corpus to expand the vocabulary and retrain a BERT-based model from scratch.
Named AlcLaM, our model was trained using only 13 GB of text, which represents a fraction of the data used by existing models.
arXiv Detail & Related papers (2024-07-18T02:13:50Z) - Bilingual Adaptation of Monolingual Foundation Models [48.859227944759986]
We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language.
Our two-stage approach begins with expanding the vocabulary and training only the embeddings matrix.
By continually pre-training on a mix of Arabic and English corpora, the model retains its proficiency in English while acquiring capabilities in Arabic.
arXiv Detail & Related papers (2024-07-13T21:09:38Z) - ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region.
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - Revisiting Pre-trained Language Models and their Evaluation for Arabic
Natural Language Understanding [44.048072667378115]
Existing Arabic PLMs are not well-explored and their pre-trainig can be improved significantly.
There is a lack of systematic and reproducible evaluation of these models in the literature.
We show that our models significantly outperform existing Arabic PLMs and achieve a new state-of-the-art performance on discriminative and generative Arabic NLU and NLG tasks.
arXiv Detail & Related papers (2022-05-21T22:38:19Z) - PAGnol: An Extra-Large French Generative Model [53.40189314359048]
We introduce PAGnol, a collection of French GPT models.
Using scaling laws, we efficiently train PAGnol-XL with the same computational budget as CamemBERT.
arXiv Detail & Related papers (2021-10-16T11:44:23Z) - AraGPT2: Pre-Trained Transformer for Arabic Language Generation [0.0]
We develop the first advanced Arabic language generation model, AraGPT2, trained from scratch on a large Arabic corpus of internet text and news articles.
Our largest model, AraGPT2-mega, has 1.46 billion parameters, which makes it the largest Arabic language model available.
For text generation, our best model achieves a perplexity of 29.8 on held-out Wikipedia articles.
arXiv Detail & Related papers (2020-12-31T09:48:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.