Zero-Resource Multi-Dialectal Arabic Natural Language Understanding
- URL: http://arxiv.org/abs/2104.06591v1
- Date: Wed, 14 Apr 2021 02:29:27 GMT
- Title: Zero-Resource Multi-Dialectal Arabic Natural Language Understanding
- Authors: Muhammad Khalifa and Hesham Hassan and Aly Fahmy
- Abstract summary: We investigate the zero-shot performance on Dialectal Arabic (DA) when fine-tuning a pre-trained language model on modern standard Arabic (MSA) data only.
We propose self-training with unlabeled DA data and apply it in the context of named entity recognition (NER), part-of-speech (POS) tagging, and sarcasm detection (SRD)
Our results demonstrate the effectiveness of self-training with unlabeled DA data.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A reasonable amount of annotated data is required for fine-tuning pre-trained
language models (PLM) on downstream tasks. However, obtaining labeled examples
for different language varieties can be costly. In this paper, we investigate
the zero-shot performance on Dialectal Arabic (DA) when fine-tuning a PLM on
modern standard Arabic (MSA) data only -- identifying a significant performance
drop when evaluating such models on DA. To remedy such performance drop, we
propose self-training with unlabeled DA data and apply it in the context of
named entity recognition (NER), part-of-speech (POS) tagging, and sarcasm
detection (SRD) on several DA varieties. Our results demonstrate the
effectiveness of self-training with unlabeled DA data: improving zero-shot
MSA-to-DA transfer by as large as \texttildelow 10\% F$_1$ (NER), 2\% accuracy
(POS tagging), and 4.5\% F$_1$ (SRD). We conduct an ablation experiment and
show that the performance boost observed directly results from the unlabeled DA
examples used for self-training. Our work opens up opportunities for leveraging
the relatively abundant labeled MSA datasets to develop DA models for zero and
low-resource dialects. We also report new state-of-the-art performance on all
three tasks and open-source our fine-tuned models for the research community.
Related papers
- RDBE: Reasoning Distillation-Based Evaluation Enhances Automatic Essay Scoring [0.0]
Reasoning Distillation-Based Evaluation (RDBE) integrates interpretability to elucidate the rationale behind model scores.
Our experimental results demonstrate the efficacy of RDBE across all scoring rubrics considered in the dataset.
arXiv Detail & Related papers (2024-07-03T05:49:01Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - GeMQuAD : Generating Multilingual Question Answering Datasets from Large Language Models using Few Shot Learning [4.8838210812204235]
In this paper, we propose GeMQuAD - a semi-supervised learning approach, applied to a dataset generated through ICL with just one example in the target language.
We iteratively identify high-quality data to enhance model performance, especially for low-resource multilingual setting.
Our framework outperforms the machine translation-augmented model by 0.22/1.68 F1/EM points for Hindi and 0.82/1.37 F1/EM points for Spanish on the MLQA dataset.
arXiv Detail & Related papers (2024-04-14T06:55:42Z) - Improving a Named Entity Recognizer Trained on Noisy Data with a Few
Clean Instances [55.37242480995541]
We propose to denoise noisy NER data with guidance from a small set of clean instances.
Along with the main NER model we train a discriminator model and use its outputs to recalibrate the sample weights.
Results on public crowdsourcing and distant supervision datasets show that the proposed method can consistently improve performance with a small guidance set.
arXiv Detail & Related papers (2023-10-25T17:23:37Z) - Discover, Explanation, Improvement: An Automatic Slice Detection
Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints.
This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks.
Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z) - Fortunately, Discourse Markers Can Enhance Language Models for Sentiment
Analysis [13.149482582098429]
We propose to leverage sentiment-carrying discourse markers to generate large-scale weakly-labeled data.
We show the value of our approach on various benchmark datasets, including the finance domain.
arXiv Detail & Related papers (2022-01-06T12:33:47Z) - On the Use of External Data for Spoken Named Entity Recognition [40.93448412171246]
Recent advances in self-supervised speech representations have made it feasible to consider learning models with limited labeled data.
We draw on a variety of approaches, including self-training, knowledge distillation, and transfer learning, and consider their applicability to both end-to-end models and pipeline approaches.
arXiv Detail & Related papers (2021-12-14T18:49:26Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z) - Self-Training Pre-Trained Language Models for Zero- and Few-Shot
Multi-Dialectal Arabic Sequence Labeling [7.310390479801139]
Self-train pre-trained language models in zero- and few-shot scenarios to improve performance on data-scarce varieties.
Our work opens up opportunities for developing DA models exploiting only MSA resources.
arXiv Detail & Related papers (2021-01-12T21:29:30Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.