Arabic Dialect Identification Using BERT-Based Domain Adaptation
- URL: http://arxiv.org/abs/2011.06977v1
- Date: Fri, 13 Nov 2020 15:52:51 GMT
- Title: Arabic Dialect Identification Using BERT-Based Domain Adaptation
- Authors: Ahmad Beltagy, Abdelrahman Wael, Omar ElSherief
- Abstract summary: Arabic is one of the most important and growing languages in the world.
With the rise of social media platforms such as Twitter, Arabic spoken dialects have become more in use.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Arabic is one of the most important and growing languages in the world. With
the rise of social media platforms such as Twitter, Arabic spoken dialects have
become more in use. In this paper, we describe our approach on the NADI Shared
Task 1 that requires us to build a system to differentiate between different 21
Arabic dialects, we introduce a deep learning semi-supervised fashion approach
along with pre-processing that was reported on NADI shared Task 1 Corpus. Our
system ranks 4th in NADI's shared task competition achieving a 23.09% F1 macro
average score with a simple yet efficient approach to differentiating between
21 Arabic Dialects given tweets.
Related papers
- Bilingual Adaptation of Monolingual Foundation Models [48.859227944759986]
We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language.
Our two-stage approach begins with expanding the vocabulary and training only the embeddings matrix.
By continually pre-training on a mix of Arabic and English corpora, the model retains its proficiency in English while acquiring capabilities in Arabic.
arXiv Detail & Related papers (2024-07-13T21:09:38Z) - Exploiting Dialect Identification in Automatic Dialectal Text Normalization [9.320305816520422]
We aim to normalize Dialectal Arabic into the Conventional Orthography for Dialectal Arabic (CODA)
We benchmark newly developed sequence-to-sequence models on the task of CODAfication.
We show that using dialect identification information improves the performance across all dialects.
arXiv Detail & Related papers (2024-07-03T11:30:03Z) - ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region.
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - Mavericks at NADI 2023 Shared Task: Unravelling Regional Nuances through
Dialect Identification using Transformer-based Approach [0.0]
We highlight our methodology for subtask 1 which deals with country-level dialect identification.
The task uses the Twitter dataset (TWT-2023) that encompasses 18 dialects for the multi-class classification problem.
We achieved an F1-score of 76.65 (11th rank on the leaderboard) on the test dataset.
arXiv Detail & Related papers (2023-11-30T17:37:56Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - The Effect of Normalization for Bi-directional Amharic-English Neural
Machine Translation [53.907805815477126]
This paper presents the first relatively large-scale Amharic-English parallel sentence dataset.
We build bi-directional Amharic-English translation models by fine-tuning the existing Facebook M2M100 pre-trained model.
The results show that the normalization of Amharic homophone characters increases the performance of Amharic-English machine translation in both directions.
arXiv Detail & Related papers (2022-10-27T07:18:53Z) - Multilingual and code-switching ASR challenges for low resource Indian
languages [59.2906853285309]
We focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages.
We provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages.
We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
arXiv Detail & Related papers (2021-04-01T03:37:01Z) - Dialect Identification in Nuanced Arabic Tweets Using Farasa
Segmentation and AraBERT [0.0]
This paper presents our approach to address the EACL WANLP-2021 Shared Task 1: Nuanced Arabic Dialect Identification (NADI)
The task is aimed at developing a system that identifies the geographical location(country/province) from where an Arabic tweet in the form of modern standard Arabic or dialect comes from.
arXiv Detail & Related papers (2021-02-19T05:39:21Z) - WOLI at SemEval-2020 Task 12: Arabic Offensive Language Identification
on Different Twitter Datasets [0.0]
A key to fight offensive language on social media is the existence of an automatic offensive language detection system.
In this paper, we describe the system submitted by WideBot AI Lab for the shared task which ranked 10th out of 52 participants with Macro-F1 86.9%.
We also introduced a neural network approach that enhanced the predictive ability of our system that includes CNN, highway network, Bi-LSTM, and attention layers.
arXiv Detail & Related papers (2020-09-11T14:10:03Z) - Multi-Dialect Arabic BERT for Country-Level Dialect Identification [1.2928709656541642]
We present the experiments conducted, and the models developed by our competing team, Mawdoo3 AI.
The dialect identification subtask provides 21,000 country-level labeled tweets covering all 21 Arab countries.
We publicly release the pre-trained language model component of our winning solution under the name of Multi-dialect-Arabic-BERT model.
arXiv Detail & Related papers (2020-07-10T21:11:46Z) - Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-Task Learning for
Offensive Language Detection [55.445023584632175]
We build an offensive language detection system, which combines multi-task learning with BERT-based models.
Our model achieves 91.51% F1 score in English Sub-task A, which is comparable to the first place.
arXiv Detail & Related papers (2020-04-28T11:27:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.