A Deep CNN Architecture with Novel Pooling Layer Applied to Two Sudanese
Arabic Sentiment Datasets
- URL: http://arxiv.org/abs/2201.12664v1
- Date: Sat, 29 Jan 2022 21:33:28 GMT
- Title: A Deep CNN Architecture with Novel Pooling Layer Applied to Two Sudanese
Arabic Sentiment Datasets
- Authors: Mustafa Mhamed, Richard Sutcliffe, Xia Sun, Jun Feng, Eiad Almekhlafi,
Ephrem A. Retta
- Abstract summary: Two new publicly available datasets are introduced, the 2-Class Sudanese Sentiment dataset and the 3-Class Sudanese Sentiment dataset.
A CNN architecture, SCM, is proposed, comprising five CNN layers together with a novel pooling layer, MMA, to extract the best features.
The proposed model is applied to the existing Saudi Sentiment dataset and to the MSA Hotel Arabic Review dataset with accuracies 85.55% and 90.01%.
- Score: 1.1034493405536276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Arabic sentiment analysis has become an important research field in recent
years. Initially, work focused on Modern Standard Arabic (MSA), which is the
most widely-used form. Since then, work has been carried out on several
different dialects, including Egyptian, Levantine and Moroccan. Moreover, a
number of datasets have been created to support such work. However, up until
now, less work has been carried out on Sudanese Arabic, a dialect which has 32
million speakers. In this paper, two new publicly available datasets are
introduced, the 2-Class Sudanese Sentiment Dataset (SudSenti2) and the 3-Class
Sudanese Sentiment Dataset (SudSenti3). Furthermore, a CNN architecture, SCM,
is proposed, comprising five CNN layers together with a novel pooling layer,
MMA, to extract the best features. This SCM+MMA model is applied to SudSenti2
and SudSenti3 with accuracies of 92.75% and 84.39%. Next, the model is compared
to other deep learning classifiers and shown to be superior on these new
datasets. Finally, the proposed model is applied to the existing Saudi
Sentiment Dataset and to the MSA Hotel Arabic Review Dataset with accuracies
85.55% and 90.01%.
Related papers
- EgyBERT: A Large Language Model Pretrained on Egyptian Dialect Corpora [0.0]
This study presents EgyBERT, an Arabic language model pretrained on 10.4 GB of Egyptian dialectal texts.
EgyBERT achieved the highest average F1-score of 84.25% and an accuracy of 87.33%.
This is the first study to evaluate the performance of various language models on Egyptian dialect datasets.
arXiv Detail & Related papers (2024-08-07T03:23:55Z) - ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation [1.8109081066789847]
Classical Arabic represents a significant era, encompassing the golden age of Arab culture, philosophy, and scientific literature.
We have identified a scarcity of translation datasets in Classical Arabic, which are often limited in scope and topics.
We present the ATHAR dataset, comprising 66,000 high-quality Classical Arabic to English translation samples.
arXiv Detail & Related papers (2024-07-29T09:45:34Z) - Arabic Text Sentiment Analysis: Reinforcing Human-Performed Surveys with
Wider Topic Analysis [49.1574468325115]
The in-depth study manually analyses 133 ASA papers published in the English language between 2002 and 2020.
The main findings show the different approaches used for ASA: machine learning, lexicon-based and hybrid approaches.
There is a need to develop ASA tools that can be used in industry, as well as in academia, for Arabic text SA.
arXiv Detail & Related papers (2024-03-04T10:37:48Z) - ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region.
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - Into the LAIONs Den: Investigating Hate in Multimodal Datasets [67.21783778038645]
This paper investigates the effect of scaling datasets on hateful content through a comparative audit of two datasets: LAION-400M and LAION-2B.
We found that hate content increased by nearly 12% with dataset scale, measured both qualitatively and quantitatively.
We also found that filtering dataset contents based on Not Safe For Work (NSFW) values calculated based on images alone does not exclude all the harmful content in alt-text.
arXiv Detail & Related papers (2023-11-06T19:00:05Z) - ArBanking77: Intent Detection Neural Model and a New Dataset in Modern
and Dialectical Arabic [0.4999814847776097]
This paper presents the ArBanking77, a large Arabic dataset for intent detection in the banking domain.
Our dataset was arabized and localized from the original English Banking77 dataset with 31,404 queries in both Modern Standard Arabic (MSA) and Palestinian dialect.
We present a neural model, based on AraBERT, fine-tuned on ArBanking77, which achieved an F1-score of 0.9209 and 0.8995 on MSA and Palestinian dialect.
arXiv Detail & Related papers (2023-10-29T14:46:11Z) - Navya3DSeg -- Navya 3D Semantic Segmentation Dataset & split generation
for autonomous vehicles [63.20765930558542]
3D semantic data are useful for core perception tasks such as obstacle detection and ego-vehicle localization.
We propose a new dataset, Navya 3D (Navya3DSeg), with a diverse label space corresponding to a large scale production grade operational domain.
It contains 23 labeled sequences and 25 supplementary sequences without labels, designed to explore self-supervised and semi-supervised semantic segmentation benchmarks on point clouds.
arXiv Detail & Related papers (2023-02-16T13:41:19Z) - Data Augmentation using Transformers and Similarity Measures for
Improving Arabic Text Classification [0.0]
We propose a new Arabic DA method that employs the recent powerful modeling technique, namely the AraGPT-2.
The generated sentences are evaluated in terms of context, semantics, diversity, and novelty using the Euclidean, cosine, Jaccard, and BLEU distances.
The experiments were conducted on four sentiment Arabic datasets: AraSarcasm, ASTD, ATT, and MOVIE.
arXiv Detail & Related papers (2022-12-28T16:38:43Z) - New Arabic Medical Dataset for Diseases Classification [55.41644538483948]
We introduce a new Arab medical dataset, which includes two thousand medical documents collected from several Arabic medical websites.
The dataset was built for the task of classifying texts and includes 10 classes (Blood, Bone, Cardiovascular, Ear, Endocrine, Eye, Gastrointestinal, Immune, Liver and Nephrological)
Experiments on the dataset were performed by fine-tuning three pre-trained models: BERT from Google, Arabert that based on BERT with large Arabic corpus, and AraBioNER that based on Arabert with Arabic medical corpus.
arXiv Detail & Related papers (2021-06-29T10:42:53Z) - Arabic Speech Recognition by End-to-End, Modular Systems and Human [56.96327247226586]
We perform a comprehensive benchmarking for end-to-end transformer ASR, modular HMM-DNN ASR, and human speech recognition.
For ASR the end-to-end work led to 12.5%, 27.5%, 23.8% WER; a new performance milestone for the MGB2, MGB3, and MGB5 challenges respectively.
Our results suggest that human performance in the Arabic language is still considerably better than the machine with an absolute WER gap of 3.6% on average.
arXiv Detail & Related papers (2021-01-21T05:55:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.