An open access NLP dataset for Arabic dialects : Data collection,
labeling, and model construction
- URL: http://arxiv.org/abs/2102.11000v1
- Date: Sun, 7 Feb 2021 01:39:52 GMT
- Title: An open access NLP dataset for Arabic dialects : Data collection,
labeling, and model construction
- Authors: ElMehdi Boujou, Hamza Chataoui, Abdellah El Mekki, Saad Benjelloun,
Ikram Chairi, and Ismail Berrada
- Abstract summary: We present an open data set of social data content in several Arabic dialects.
This data was collected from the Twitter social network and consists on +50K twits in five (5) national dialects.
We publish this data as an open access data to encourage innovation and encourage other works in the field of NLP for Arabic dialects and social media.
- Score: 0.8312466807725921
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Natural Language Processing (NLP) is today a very active field of research
and innovation. Many applications need however big sets of data for supervised
learning, suitably labelled for the training purpose. This includes
applications for the Arabic language and its national dialects. However, such
open access labeled data sets in Arabic and its dialects are lacking in the
Data Science ecosystem and this lack can be a burden to innovation and research
in this field. In this work, we present an open data set of social data content
in several Arabic dialects. This data was collected from the Twitter social
network and consists on +50K twits in five (5) national dialects. Furthermore,
this data was labeled for several applications, namely dialect detection, topic
detection and sentiment analysis. We publish this data as an open access data
to encourage innovation and encourage other works in the field of NLP for
Arabic dialects and social media. A selection of models were built using this
data set and are presented in this paper along with their performances.
Related papers
- ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation [1.8109081066789847]
Classical Arabic represents a significant era, encompassing the golden age of Arab culture, philosophy, and scientific literature.
We have identified a scarcity of translation datasets in Classical Arabic, which are often limited in scope and topics.
We present the ATHAR dataset, comprising 66,000 high-quality Classical Arabic to English translation samples.
arXiv Detail & Related papers (2024-07-29T09:45:34Z) - Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - Open the Data! Chuvash Datasets [50.59120569845975]
We introduce four comprehensive datasets for the Chuvash language.
These datasets include a monolingual dataset, a parallel dataset with Russian, a parallel dataset with English, and an audio dataset.
arXiv Detail & Related papers (2024-05-31T07:51:19Z) - Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning [49.79783940841352]
Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio.
A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Improving Natural Language Inference in Arabic using Transformer Models
and Linguistically Informed Pre-Training [0.34998703934432673]
This paper addresses the classification of Arabic text data in the field of Natural Language Processing (NLP)
To overcome this limitation, we create a dedicated data set from publicly available resources.
We find that a language-specific model (AraBERT) performs competitively with state-of-the-art multilingual approaches.
arXiv Detail & Related papers (2023-07-27T07:40:11Z) - Izindaba-Tindzaba: Machine learning news categorisation for Long and
Short Text for isiZulu and Siswati [1.666378501554705]
Local/Native South African languages are classified as low-resource languages.
In this work, the focus was to create annotated news datasets for the isiZulu and Siswati native languages.
arXiv Detail & Related papers (2023-06-12T21:02:12Z) - An Amharic News Text classification Dataset [0.0]
We aim to introduce the Amharic text classification dataset that consists of more than 50k news articles that were categorized into 6 classes.
This dataset is made available with easy baseline performances to encourage studies and better performance experiments.
arXiv Detail & Related papers (2021-03-10T16:36:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.