Self-Training Pre-Trained Language Models for Zero- and Few-Shot
Multi-Dialectal Arabic Sequence Labeling
- URL: http://arxiv.org/abs/2101.04758v4
- Date: Tue, 2 Feb 2021 23:36:04 GMT
- Title: Self-Training Pre-Trained Language Models for Zero- and Few-Shot
Multi-Dialectal Arabic Sequence Labeling
- Authors: Muhammad Khalifa and Muhammad Abdul-Mageed and Khaled Shaalan
- Abstract summary: Self-train pre-trained language models in zero- and few-shot scenarios to improve performance on data-scarce varieties.
Our work opens up opportunities for developing DA models exploiting only MSA resources.
- Score: 7.310390479801139
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A sufficient amount of annotated data is usually required to fine-tune
pre-trained language models for downstream tasks. Unfortunately, attaining
labeled data can be costly, especially for multiple language varieties and
dialects. We propose to self-train pre-trained language models in zero- and
few-shot scenarios to improve performance on data-scarce varieties using only
resources from data-rich ones. We demonstrate the utility of our approach in
the context of Arabic sequence labeling by using a language model fine-tuned on
Modern Standard Arabic (MSA) only to predict named entities (NE) and
part-of-speech (POS) tags on several dialectal Arabic (DA) varieties. We show
that self-training is indeed powerful, improving zero-shot MSA-to-DA transfer
by as large as \texttildelow 10\% F$_1$ (NER) and 2\% accuracy (POS tagging).
We acquire even better performance in few-shot scenarios with limited amounts
of labeled data. We conduct an ablation study and show that the performance
boost observed directly results from the unlabeled DA examples used for
self-training. Our work opens up opportunities for developing DA models
exploiting only MSA resources and it can be extended to other languages and
tasks. Our code and fine-tuned models can be accessed at
https://github.com/mohammadKhalifa/zero-shot-arabic-dialects.
Related papers
- Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - QAmeleon: Multilingual QA with Only 5 Examples [71.80611036543633]
We show how to leverage pre-trained language models under a few-shot learning setting.
Our approach, QAmeleon, uses a PLM to automatically generate multilingual data upon which QA models are trained.
Prompt tuning the PLM for data synthesis with only five examples per language delivers accuracy superior to translation-based baselines.
arXiv Detail & Related papers (2022-11-15T16:14:39Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z) - Bilingual Alignment Pre-training for Zero-shot Cross-lingual Transfer [33.680292990007366]
In this paper, we aim to improve the zero-shot cross-lingual transfer performance by aligning the embeddings better.
We propose a pre-training task named Alignment Language Model (AlignLM) which uses the statistical alignment information as the prior knowledge to guide bilingual word prediction.
The results show AlignLM can improve the zero-shot performance significantly on MLQA and XNLI datasets.
arXiv Detail & Related papers (2021-06-03T10:18:43Z) - Zero-Resource Multi-Dialectal Arabic Natural Language Understanding [0.0]
We investigate the zero-shot performance on Dialectal Arabic (DA) when fine-tuning a pre-trained language model on modern standard Arabic (MSA) data only.
We propose self-training with unlabeled DA data and apply it in the context of named entity recognition (NER), part-of-speech (POS) tagging, and sarcasm detection (SRD)
Our results demonstrate the effectiveness of self-training with unlabeled DA data.
arXiv Detail & Related papers (2021-04-14T02:29:27Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Multilingual Jointly Trained Acoustic and Written Word Embeddings [22.63696520064212]
We extend this idea to multiple low-resource languages.
We jointly train an AWE model and an AGWE model, using phonetically transcribed data from multiple languages.
The pre-trained models can then be used for unseen zero-resource languages, or fine-tuned on data from low-resource languages.
arXiv Detail & Related papers (2020-06-24T19:16:02Z) - ParsBERT: Transformer-based Model for Persian Language Understanding [0.7646713951724012]
This paper proposes a monolingual BERT for the Persian language (ParsBERT)
It shows its state-of-the-art performance compared to other architectures and multilingual models.
ParsBERT obtains higher scores in all datasets, including existing ones as well as composed ones.
arXiv Detail & Related papers (2020-05-26T05:05:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.