Sentiment Analysis Dataset in Moroccan Dialect: Bridging the Gap Between
Arabic and Latin Scripted dialect
- URL: http://arxiv.org/abs/2303.15987v2
- Date: Mon, 6 Nov 2023 18:38:55 GMT
- Title: Sentiment Analysis Dataset in Moroccan Dialect: Bridging the Gap Between
Arabic and Latin Scripted dialect
- Authors: Mouad Jbel, Imad Hafidi, Abdulmutallib Metrane
- Abstract summary: This study emphasizes the importance of extending sentiment analysis to encompass the entire spectrum of Moroccan linguistic diversity.
By assembling a diverse range of textual data, we were able to construct a dataset with a range of 20 000 manually labeled text in Moroccan dialect.
To dive into sentiment analysis, we conducted a comparative study on multiple Machine learning models to assess their compatibility with our dataset.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sentiment analysis, the automated process of determining emotions or opinions
expressed in text, has seen extensive exploration in the field of natural
language processing. However, one aspect that has remained underrepresented is
the sentiment analysis of the Moroccan dialect, which boasts a unique
linguistic landscape and the coexistence of multiple scripts. Previous works in
sentiment analysis primarily targeted dialects employing Arabic script. While
these efforts provided valuable insights, they may not fully capture the
complexity of Moroccan web content, which features a blend of Arabic and Latin
script. As a result, our study emphasizes the importance of extending sentiment
analysis to encompass the entire spectrum of Moroccan linguistic diversity.
Central to our research is the creation of the largest public dataset for
Moroccan dialect sentiment analysis that incorporates not only Moroccan dialect
written in Arabic script but also in Latin letters. By assembling a diverse
range of textual data, we were able to construct a dataset with a range of 20
000 manually labeled text in Moroccan dialect and also publicly available lists
of stop words in Moroccan dialect. To dive into sentiment analysis, we
conducted a comparative study on multiple Machine learning models to assess
their compatibility with our dataset. Experiments were performed using both raw
and preprocessed data to show the importance of the preprocessing step. We were
able to achieve 92% accuracy in our model and to further prove its liability we
tested our model on smaller publicly available datasets of Moroccan dialect and
the results were favorable.
Related papers
- Strategies for Arabic Readability Modeling [9.976720880041688]
Automatic readability assessment is relevant to building NLP applications for education, content analysis, and accessibility.
We present a set of experimental results on Arabic readability assessment using a diverse range of approaches.
arXiv Detail & Related papers (2024-07-03T11:54:11Z) - The First Swahili Language Scene Text Detection and Recognition Dataset [55.83178123785643]
There is a significant gap in low-resource languages, especially the Swahili Language.
Swahili is widely spoken in East African countries but is still an under-explored language in scene text recognition.
We propose a comprehensive dataset of Swahili scene text images and evaluate the dataset on different scene text detection and recognition models.
arXiv Detail & Related papers (2024-05-19T03:55:02Z) - Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages [44.017657230247934]
We present textitSemRel, a new semantic relatedness dataset collection annotated by native speakers across 13 languages.
These languages originate from five distinct language families and are predominantly spoken in Africa and Asia.
Each instance in the SemRel datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences.
arXiv Detail & Related papers (2024-02-13T18:04:53Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Noor-Ghateh: A Benchmark Dataset for Evaluating Arabic Word Segmenters
in Hadith Domain [6.10917825357379]
We present a benchmark data set for evaluating the methods of separating Arabic words.
This dataset includes about 223,690 words from the book of Sharia alIslam, which have been labeled by experts.
arXiv Detail & Related papers (2023-06-22T16:50:40Z) - Sentiment Analysis in Poems in Misurata Sub-dialect -- A Sentiment
Detection in an Arabic Sub-dialect [0.0]
This study focuses on detecting sentiment in poems written in Misurata Arabic sub-dialect spoken in Libya.
The tools used to detect sentiment from the dataset are Sklearn as well as Mazajak sentiment tool 1.
arXiv Detail & Related papers (2021-09-15T10:42:39Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Automatic Arabic Dialect Identification Systems for Written Texts: A
Survey [0.0]
Arabic dialect identification is a specific task of natural language processing, aiming to automatically predict the Arabic dialect of a given text.
In this paper, we present a comprehensive survey of Arabic dialect identification research in written texts.
We review the traditional machine learning methods, deep learning architectures, and complex learning approaches to Arabic dialect identification.
arXiv Detail & Related papers (2020-09-26T15:33:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.