Related papers: Arabic Dialect Identification under Scrutiny: Limitations of Single-label Classification

Arabic Dialect Identification under Scrutiny: Limitations of Single-label Classification

URL: http://arxiv.org/abs/2310.13661v1
Date: Fri, 20 Oct 2023 17:04:22 GMT
Title: Arabic Dialect Identification under Scrutiny: Limitations of Single-label Classification
Authors: Amr Keleg and Walid Magdy
Abstract summary: We argue that the currently adopted framing of the ADI task as a single-label classification problem is one of the main reasons for that. A manual error analysis for the predictions of an ADI, performed by 7 native speakers of different Arabic dialects, revealed that $approx$ 66% of the validated errors are not true errors. We propose framing ADI as a multi-label classification task and give recommendations for designing new ADI datasets.
Score: 12.201535821920624
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automatic Arabic Dialect Identification (ADI) of text has gained great popularity since it was introduced in the early 2010s. Multiple datasets were developed, and yearly shared tasks have been running since 2018. However, ADI systems are reported to fail in distinguishing between the micro-dialects of Arabic. We argue that the currently adopted framing of the ADI task as a single-label classification problem is one of the main reasons for that. We highlight the limitation of the incompleteness of the Dialect labels and demonstrate how it impacts the evaluation of ADI systems. A manual error analysis for the predictions of an ADI, performed by 7 native speakers of different Arabic dialects, revealed that $\approx$ 66% of the validated errors are not true errors. Consequently, we propose framing ADI as a multi-label classification task and give recommendations for designing new ADI datasets.

Related papers

Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models [41.723923327955355]
We show that Arabic Dialect Identification (ADI) should be framed as a multi-label classification task.<n>By analyzing models trained on single-label ADI data, we show that the main difficulty in repurposing such datasets for Multi-Label Arabic Dialect Identification (MLADI) lies in the selection of negative samples.<n>We construct a multi-label dataset by generating automatic multi-label annotations and aggregation guided by the Arabic Level of Dialectness (ALDi)<n>Our best-performing LAHJATBERT model achieves a macro F1 of 0.69, compared to 0.55 for the strongest previously reported system.
arXiv Detail & Related papers (2026-02-12T09:30:55Z)
ADI-20: Arabic Dialect Identification dataset and models [11.457009449330068]
We present ADI-20, an extension of the previously published ADI-17 Arabic Dialect Identification (ADI) dataset.<n>ADI-20 covers all Arabic-speaking countries' dialects.<n>We used this dataset to train and evaluate various state-of-the-art ADI systems.
arXiv Detail & Related papers (2025-11-13T08:17:00Z)
Active Generalized Category Discovery [60.69060965936214]
Generalized Category Discovery (GCD) endeavors to cluster unlabeled samples from both novel and old classes. We take the spirit of active learning and propose a new setting called Active Generalized Category Discovery (AGCD) Our method achieves state-of-the-art performance on both generic and fine-grained datasets.
arXiv Detail & Related papers (2024-03-07T07:12:24Z)
VariErr NLI: Separating Annotation Error from Human Label Variation [23.392480595432676]
We introduce a systematic methodology and a new dataset, VariErr (variation versus error) VariErr contains 7,732 validity judgments on 1,933 explanations for 500 re-annotated MNLI items. We find that state-of-the-art AED methods significantly underperform GPTs and humans.
arXiv Detail & Related papers (2024-03-04T10:57:14Z)
ROG$_{PL}$: Robust Open-Set Graph Learning via Region-Based Prototype Learning [52.60434474638983]
We propose a unified framework named ROG$_PL$ to achieve robust open-set learning on complex noisy graph data. The framework consists of two modules, i.e., denoising via label propagation and open-set prototype learning via regions. To the best of our knowledge, the proposed ROG$_PL$ is the first robust open-set node classification method for graph data with complex noise.
arXiv Detail & Related papers (2024-02-28T17:25:06Z)
ALDi: Quantifying the Arabic Level of Dialectness of Text [17.37857915257019]
We argue that Arabic speakers perceive a spectrum of dialectness, which we operationalize at the sentence level as the Arabic Level of Dialectness (ALDi) We provide a detailed analysis of AOC-ALDi and show that a model trained on it can effectively identify levels of dialectness on a range of other corpora.
arXiv Detail & Related papers (2023-10-20T18:07:39Z)
A Parameter-Efficient Learning Approach to Arabic Dialect Identification with Pre-Trained General-Purpose Speech Model [9.999900422312098]
We develop a token-level label mapping to condition the GSM for Arabic Dialect Identification (ADI) We achieve new state-of-the-art accuracy on the ADI-17 dataset by vanilla fine-tuning. Our study demonstrates how to identify Arabic dialects using a small dataset and limited with open source code and pre-trained models.
arXiv Detail & Related papers (2023-05-18T18:15:53Z)
Exploring Structured Semantic Prior for Multi Label Recognition with Incomplete Labels [60.675714333081466]
Multi-label recognition (MLR) with incomplete labels is very challenging. Recent works strive to explore the image-to-label correspondence in the vision-language model, ie, CLIP, to compensate for insufficient annotations. We advocate remedying the deficiency of label supervision for the MLR with incomplete labels by deriving a structured semantic prior.
arXiv Detail & Related papers (2023-03-23T12:39:20Z)
On Non-Random Missing Labels in Semi-Supervised Learning [114.62655062520425]
Semi-Supervised Learning (SSL) is fundamentally a missing label problem. We explicitly incorporate "class" into SSL. Our method not only significantly outperforms existing baselines but also surpasses other label bias removal SSL methods.
arXiv Detail & Related papers (2022-06-29T22:01:29Z)
Automatic Error Type Annotation for Arabic [20.51341894424478]
We present ARETA, an automatic error type annotation system for Modern Standard Arabic. We base our error taxonomy on the Arabic Learner Corpus (ALC) Error Tagset with some modifications. ARETA achieves a performance of 85.8% (micro average F1 score) on a manually annotated blind test portion of ALC.
arXiv Detail & Related papers (2021-09-16T15:50:11Z)
Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations. We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z)
Multi-Dialect Arabic BERT for Country-Level Dialect Identification [1.2928709656541642]
We present the experiments conducted, and the models developed by our competing team, Mawdoo3 AI. The dialect identification subtask provides 21,000 country-level labeled tweets covering all 21 Arab countries. We publicly release the pre-trained language model component of our winning solution under the name of Multi-dialect-Arabic-BERT model.
arXiv Detail & Related papers (2020-07-10T21:11:46Z)
Unsupervised Person Re-identification via Multi-label Classification [55.65870468861157]
This paper formulates unsupervised person ReID as a multi-label classification task to progressively seek true labels. Our method starts by assigning each person image with a single-class label, then evolves to multi-label classification by leveraging the updated ReID model for label prediction. To boost the ReID model training efficiency in multi-label classification, we propose the memory-based multi-label classification loss (MMCL)
arXiv Detail & Related papers (2020-04-20T12:13:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.