The Interplay of Variant, Size, and Task Type in Arabic Pre-trained
Language Models
- URL: http://arxiv.org/abs/2103.06678v1
- Date: Thu, 11 Mar 2021 14:11:43 GMT
- Title: The Interplay of Variant, Size, and Task Type in Arabic Pre-trained
Language Models
- Authors: Go Inoue, Bashar Alhafni, Nurpeiis Baimukan, Houda Bouamor, Nizar
Habash
- Abstract summary: We build three pre-trained language models across three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic.
We examine the importance of pre-training data size by building additional models that are pre-trained on a scaled-down set of the MSA variant.
We compare our different models to each other, as well as to eight publicly available models by fine-tuning them on five NLP tasks spanning 12 datasets.
- Score: 14.122707779713647
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we explore the effects of language variants, data sizes, and
fine-tuning task types in Arabic pre-trained language models. To do so, we
build three pre-trained language models across three variants of Arabic: Modern
Standard Arabic (MSA), dialectal Arabic, and classical Arabic, in addition to a
fourth language model which is pre-trained on a mix of the three. We also
examine the importance of pre-training data size by building additional models
that are pre-trained on a scaled-down set of the MSA variant. We compare our
different models to each other, as well as to eight publicly available models
by fine-tuning them on five NLP tasks spanning 12 datasets. Our results suggest
that the variant proximity of pre-training data to fine-tuning data is more
important than the pre-training data size. We exploit this insight in defining
an optimized system selection model for the studied tasks.
Related papers
- Approximating Language Model Training Data from Weights [70.08614275061689]
We formalize the problem of data approximation from model weights and propose several baselines and metrics.<n>We develop a gradient-based approach that selects the highest-matching data from a large public text corpus.<n>Even when none of the true training data is known, our method is able to locate a small subset of public Web documents.
arXiv Detail & Related papers (2025-06-18T15:26:43Z) - KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model [27.25688303240741]
KaLM-Embedding is a general multilingual embedding model that leverages a large quantity of cleaner, more diverse, and domain-specific training data.
Our model has been trained with key techniques proven to enhance performance.
arXiv Detail & Related papers (2025-01-02T03:17:51Z) - Improving Language Models Trained on Translated Data with Continual Pre-Training and Dictionary Learning Analysis [3.16714407449467]
We investigate the role of translation and synthetic data in training language models.
We translate TinyStories, a dataset of 2.2M short stories for 3-4 year old children, from English to Arabic using the open NLLB-3B MT model.
To rectify these issues, we pre-train the models with a small dataset of synthesized high-quality Arabic stories.
arXiv Detail & Related papers (2024-05-23T07:53:04Z) - Smaller Language Models are capable of selecting Instruction-Tuning
Training Data for Larger Language Models [39.65879784788677]
We introduce a novel training data selection based on the learning percentage of the samples.
We assert that current language models possess the capability to autonomously select high-quality training data.
Our paper introduces a novel approach to training data selection, showcasing a more efficient alternative.
arXiv Detail & Related papers (2024-02-16T03:39:37Z) - Multilingual E5 Text Embeddings: A Technical Report [63.503320030117145]
Three embedding models of different sizes are provided, offering a balance between the inference efficiency and embedding quality.
We introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes.
arXiv Detail & Related papers (2024-02-08T13:47:50Z) - CroissantLLM: A Truly Bilingual French-English Language Model [42.03897426049679]
We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens.
We pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio.
To assess performance outside of English, we craft a novel benchmark, FrenchBench.
arXiv Detail & Related papers (2024-02-01T17:17:55Z) - On the importance of Data Scale in Pretraining Arabic Language Models [46.431706010614334]
We conduct a comprehensive study on the role of data in Arabic Pretrained Language Models (PLMs)
We reassess the performance of a suite of state-of-the-art Arabic PLMs by retraining them on massive-scale, high-quality Arabic corpora.
Our analysis strongly suggests that pretraining data by far is the primary contributor to performance, surpassing other factors.
arXiv Detail & Related papers (2024-01-15T15:11:15Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - Improving Large-scale Language Models and Resources for Filipino [0.0]
We outline the construction of the TLUnified dataset, a large-scale pretraining corpus for the Filipino language.
Second, we pretrain new Transformer language models following the RoBERTa pretraining technique to supplant existing models trained with small corpora.
Our new RoBERTa models show significant improvements over existing Filipino models in three benchmark datasets with an average gain of 4.47% test accuracy.
arXiv Detail & Related papers (2021-11-11T05:00:58Z) - Pre-Training a Language Model Without Human Language [74.11825654535895]
We study how the intrinsic nature of pre-training data contributes to the fine-tuned downstream performance.
We find that models pre-trained on unstructured data beat those trained directly from scratch on downstream tasks.
To our great astonishment, we uncover that pre-training on certain non-human language data gives GLUE performance close to performance pre-trained on another non-English language.
arXiv Detail & Related papers (2020-12-22T13:38:06Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.