Command R7B Arabic: A Small, Enterprise Focused, Multilingual, and Culturally Aware Arabic LLM
- URL: http://arxiv.org/abs/2503.14603v1
- Date: Tue, 18 Mar 2025 18:03:49 GMT
- Title: Command R7B Arabic: A Small, Enterprise Focused, Multilingual, and Culturally Aware Arabic LLM
- Authors: Yazeed Alnumay, Alexandre Barbet, Anna Bialas, William Darling, Shaan Desai, Joan Devassy, Kyle Duffy, Stephanie Howe, Olivia Lasche, Justin Lee, Anirudh Shrinivason, Jennifer Tracey,
- Abstract summary: Building high-quality large language models (LLMs) for enterprise Arabic applications remains challenging due to the limited availability of digitized Arabic data.<n>We present a data synthesis and refinement strategy to help address this problem, namely, by leveraging synthetic data generation and human-in-the-loop annotation.<n>The culmination of this effort is the release of a small, 7B, open-weight model that outperforms similarly sized peers in head-to-head comparisons and on Arabic-focused benchmarks.
- Score: 32.99591671206201
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Building high-quality large language models (LLMs) for enterprise Arabic applications remains challenging due to the limited availability of digitized Arabic data. In this work, we present a data synthesis and refinement strategy to help address this problem, namely, by leveraging synthetic data generation and human-in-the-loop annotation to expand our Arabic training corpus. We further present our iterative post training recipe that is essential to achieving state-of-the-art performance in aligning the model with human preferences, a critical aspect to enterprise use cases. The culmination of this effort is the release of a small, 7B, open-weight model that outperforms similarly sized peers in head-to-head comparisons and on Arabic-focused benchmarks covering cultural knowledge, instruction following, RAG, and contextual faithfulness.
Related papers
- Sadeed: Advancing Arabic Diacritization Through Small Language Model [0.0]
We introduce Sadeed, a novel decoder-only language model for Arabic diacritization.
Sadeed is fine-tuned on carefully curated, high-quality diacritized datasets, constructed through a rigorous data-cleaning and normalization pipeline.
We introduce SadeedDiac-25, a new benchmark designed to enable fairer and more comprehensive evaluation across diverse text genres and complexity levels.
arXiv Detail & Related papers (2025-04-30T13:37:24Z) - Advancing Arabic Reverse Dictionary Systems: A Transformer-Based Approach with Dataset Construction Guidelines [0.8944616102795021]
This study addresses the critical gap in Arabic natural language processing by developing an effective Arabic Reverse Dictionary (RD) system.
We present a novel transformer-based approach with a semi-encoder neural network architecture featuring geometrically decreasing layers.
Our methodology incorporates a comprehensive dataset construction process and establishes formal quality standards for Arabic lexicographic definitions.
arXiv Detail & Related papers (2025-04-30T09:56:36Z) - LLMic: Romanian Foundation Language Model [76.09455151754062]
We present LLMic, a foundation language model designed specifically for the Romanian Language.
We show that fine-tuning LLMic for language translation after the initial pretraining phase outperforms existing solutions in English-to-Romanian translation tasks.
arXiv Detail & Related papers (2025-01-13T22:14:45Z) - Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion [55.27025066199226]
This paper addresses the need for democratizing large language models (LLM) in the Arab world.
One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding.
Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion.
arXiv Detail & Related papers (2024-12-16T19:29:06Z) - GemmAr: Enhancing LLMs Through Arabic Instruction-Tuning [0.0]
We introduce InstAr-500k, a new Arabic instruction dataset created by generating and collecting content.
We assess this dataset by fine-tuning an open-source Gemma-7B model on several downstream tasks to improve its functionality.
Based on multiple evaluations, our fine-tuned model achieves excellent performance on several Arabic NLP benchmarks.
arXiv Detail & Related papers (2024-07-02T10:43:49Z) - On the importance of Data Scale in Pretraining Arabic Language Models [46.431706010614334]
We conduct a comprehensive study on the role of data in Arabic Pretrained Language Models (PLMs)
We reassess the performance of a suite of state-of-the-art Arabic PLMs by retraining them on massive-scale, high-quality Arabic corpora.
Our analysis strongly suggests that pretraining data by far is the primary contributor to performance, surpassing other factors.
arXiv Detail & Related papers (2024-01-15T15:11:15Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - RAVEN: In-Context Learning with Retrieval-Augmented Encoder-Decoder Language Models [57.12888828853409]
RAVEN is a model that combines retrieval-augmented masked language modeling and prefix language modeling.
Fusion-in-Context Learning enables the model to leverage more in-context examples without requiring additional training.
Our work underscores the potential of retrieval-augmented encoder-decoder language models for in-context learning.
arXiv Detail & Related papers (2023-08-15T17:59:18Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - Revisiting Pre-trained Language Models and their Evaluation for Arabic
Natural Language Understanding [44.048072667378115]
Existing Arabic PLMs are not well-explored and their pre-trainig can be improved significantly.
There is a lack of systematic and reproducible evaluation of these models in the literature.
We show that our models significantly outperform existing Arabic PLMs and achieve a new state-of-the-art performance on discriminative and generative Arabic NLU and NLG tasks.
arXiv Detail & Related papers (2022-05-21T22:38:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.