Arabic Dialect Identification in the Wild
- URL: http://arxiv.org/abs/2005.06557v2
- Date: Fri, 15 May 2020 08:23:17 GMT
- Title: Arabic Dialect Identification in the Wild
- Authors: Ahmed Abdelali, Hamdy Mubarak, Younes Samih, Sabit Hassan, Kareem
Darwish
- Abstract summary: We present QADI, an automatically collected dataset of tweets belonging to a wide range of country-level Arabic dialects.
The resultant dataset contains 540k tweets from 2,525 users who are evenly distributed across 18 Arab countries.
- Score: 10.010733302895938
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present QADI, an automatically collected dataset of tweets belonging to a
wide range of country-level Arabic dialects -covering 18 different countries in
the Middle East and North Africa region. Our method for building this dataset
relies on applying multiple filters to identify users who belong to different
countries based on their account descriptions and to eliminate tweets that are
either written in Modern Standard Arabic or contain inappropriate language. The
resultant dataset contains 540k tweets from 2,525 users who are evenly
distributed across 18 Arab countries. Using intrinsic evaluation, we show that
the labels of a set of randomly selected tweets are 91.5% accurate. For
extrinsic evaluation, we are able to build effective country-level dialect
identification on tweets with a macro-averaged F1-score of 60.6% across 18
classes.
Related papers
- Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region.
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - Mavericks at NADI 2023 Shared Task: Unravelling Regional Nuances through
Dialect Identification using Transformer-based Approach [0.0]
We highlight our methodology for subtask 1 which deals with country-level dialect identification.
The task uses the Twitter dataset (TWT-2023) that encompasses 18 dialects for the multi-class classification problem.
We achieved an F1-score of 76.65 (11th rank on the leaderboard) on the test dataset.
arXiv Detail & Related papers (2023-11-30T17:37:56Z) - ALDi: Quantifying the Arabic Level of Dialectness of Text [17.37857915257019]
We argue that Arabic speakers perceive a spectrum of dialectness, which we operationalize at the sentence level as the Arabic Level of Dialectness (ALDi)
We provide a detailed analysis of AOC-ALDi and show that a model trained on it can effectively identify levels of dialectness on a range of other corpora.
arXiv Detail & Related papers (2023-10-20T18:07:39Z) - SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization
Evaluation [52.186343500576214]
We introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation.
SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality.
We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE and mFACE.
arXiv Detail & Related papers (2023-05-22T16:25:07Z) - AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages [45.88640066767242]
Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents.
Yet, there is little NLP research conducted on African languages. Crucial to enabling such research is the availability of high-quality annotated datasets.
In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of >110,000 tweets in 14 African languages.
arXiv Detail & Related papers (2023-02-17T15:40:12Z) - ORCA: A Challenging Benchmark for Arabic Language Understanding [8.9379057739817]
ORCA is a publicly available benchmark for Arabic language understanding evaluation.
To measure current progress in Arabic NLU, we use ORCA to offer a comprehensive comparison between 18 multilingual and Arabic language models.
arXiv Detail & Related papers (2022-12-21T04:35:43Z) - Comprehensive Benchmark Datasets for Amharic Scene Text Detection and
Recognition [56.048783994698425]
Ethiopic/Amharic script is one of the oldest African writing systems, which serves at least 23 languages in East Africa.
The Amharic writing system, Abugida, has 282 syllables, 15 punctuation marks, and 20 numerals.
We presented the first comprehensive public datasets named HUST-ART, HUST-AST, ABE, and Tana for Amharic script detection and recognition in the natural scene.
arXiv Detail & Related papers (2022-03-23T03:19:35Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Arabic Offensive Language on Twitter: Analysis and Experiments [9.879488163141813]
We introduce a method for building a dataset that is not biased by topic, dialect, or target.
We produce the largest Arabic dataset to date with special tags for vulgarity and hate speech.
arXiv Detail & Related papers (2020-04-05T13:05:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.