Parameter and Data Efficient Continual Pre-training for Robustness to
Dialectal Variance in Arabic
- URL: http://arxiv.org/abs/2211.03966v1
- Date: Tue, 8 Nov 2022 02:51:57 GMT
- Title: Parameter and Data Efficient Continual Pre-training for Robustness to
Dialectal Variance in Arabic
- Authors: Soumajyoti Sarkar, Kaixiang Lin, Sailik Sengupta, Leonard Lausen,
Sheng Zha, Saab Mansour
- Abstract summary: We show that multilingual-BERT (mBERT) incrementally pretrained on Arabic monolingual data takes less training time and yields comparable accuracy when compared to our custom monolingual Arabic model.
We then explore two continual pre-training methods-- (1) using small amounts of dialectical data for continual finetuning and (2) parallel Arabic to English data and a Translation Language Modeling loss function.
- Score: 9.004920233490642
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The use of multilingual language models for tasks in low and high-resource
languages has been a success story in deep learning. In recent times, Arabic
has been receiving widespread attention on account of its dialectal variance.
While prior research studies have tried to adapt these multilingual models for
dialectal variants of Arabic, it still remains a challenging problem owing to
the lack of sufficient monolingual dialectal data and parallel translation data
of such dialectal variants. It remains an open problem on whether the limited
dialectical data can be used to improve the models trained in Arabic on its
dialectal variants. First, we show that multilingual-BERT (mBERT) incrementally
pretrained on Arabic monolingual data takes less training time and yields
comparable accuracy when compared to our custom monolingual Arabic model and
beat existing models (by an avg metric of +$6.41$). We then explore two
continual pre-training methods-- (1) using small amounts of dialectical data
for continual finetuning and (2) parallel Arabic to English data and a
Translation Language Modeling loss function. We show that both approaches help
improve performance on dialectal classification tasks ($+4.64$ avg. gain) when
used on monolingual models.
Related papers
- Bilingual Adaptation of Monolingual Foundation Models [48.859227944759986]
We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language.
Our two-stage approach begins with expanding the vocabulary and training only the embeddings matrix.
By continually pre-training on a mix of Arabic and English corpora, the model retains its proficiency in English while acquiring capabilities in Arabic.
arXiv Detail & Related papers (2024-07-13T21:09:38Z) - Modeling Orthographic Variation in Occitan's Dialects [3.038642416291856]
Large multilingual models minimize the need for spelling normalization during pre-processing.
Our findings suggest that large multilingual models minimize the need for spelling normalization during pre-processing.
arXiv Detail & Related papers (2024-04-30T07:33:51Z) - Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of
Multilingual Language Models [73.11488464916668]
This study investigates the dynamics of the multilingual pretraining process.
We probe checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks.
Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones.
arXiv Detail & Related papers (2022-05-24T03:35:00Z) - Language Contamination Explains the Cross-lingual Capabilities of
English Pretrained Models [79.38278330678965]
We find that common English pretraining corpora contain significant amounts of non-English text.
This leads to hundreds of millions of foreign language tokens in large-scale datasets.
We then demonstrate that even these small percentages of non-English data facilitate cross-lingual transfer for models trained on them.
arXiv Detail & Related papers (2022-04-17T23:56:54Z) - Morphosyntactic Tagging with Pre-trained Language Models for Arabic and
its Dialects [17.063334758301902]
We present state-of-the-art results on morphosyntactic tagging across different varieties of Arabic using fine-tuned pre-trained language models.
Our models consistently outperform existing systems in Modern Standard Arabic and all the Arabic dialects we study.
arXiv Detail & Related papers (2021-10-13T16:43:44Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - When Being Unseen from mBERT is just the Beginning: Handling New
Languages With Multilingual Language Models [2.457872341625575]
Transfer learning based on pretraining language models on a large amount of raw data has become a new norm to reach state-of-the-art performance in NLP.
We show that such models behave in multiple ways on unseen languages.
arXiv Detail & Related papers (2020-10-24T10:15:03Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z) - Can Multilingual Language Models Transfer to an Unseen Dialect? A Case
Study on North African Arabizi [2.76240219662896]
We study the ability of multilingual language models to process an unseen dialect.
We take user generated North-African Arabic as our case study.
We show in zero-shot and unsupervised adaptation scenarios that multilingual language models are able to transfer to such an unseen dialect.
arXiv Detail & Related papers (2020-05-01T11:29:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.