Content-Localization based System for Analyzing Sentiment and Hate
Behaviors in Low-Resource Dialectal Arabic: English to Levantine and Gulf
- URL: http://arxiv.org/abs/2312.03727v1
- Date: Mon, 27 Nov 2023 15:37:33 GMT
- Title: Content-Localization based System for Analyzing Sentiment and Hate
Behaviors in Low-Resource Dialectal Arabic: English to Levantine and Gulf
- Authors: Fatimah Alzamzami, Abdulmotaleb El Saddik
- Abstract summary: This paper proposes to localize content of resources in high-resourced languages into under-resourced Arabic dialects.
We utilize content-localization based neural machine translation to develop sentiment and hate classifiers for two low-resourced Arabic dialects: Levantine and Gulf.
Our findings shed light on the importance of considering the unique nature of dialects within the same language and ignoring the dialectal aspect would lead to misleading analysis.
- Score: 5.2957928879391
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Even though online social movements can quickly become viral on social media,
languages can be a barrier to timely monitoring and analyzing the underlying
online social behaviors (OSB). This is especially true for under-resourced
languages on social media like dialectal Arabic; the primary language used by
Arabs on social media. Therefore, it is crucial to provide solutions to
efficiently exploit resources from high-resourced languages to solve
language-dependent OSB analysis in under-resourced languages. This paper
proposes to localize content of resources in high-resourced languages into
under-resourced Arabic dialects. Content localization goes beyond content
translation that converts text from one language to another; content
localization adapts culture, language nuances and regional preferences from one
language to a specific language/dialect. Automating understanding of the
natural and familiar day-to-day expressions in different regions, is the key to
achieve a wider analysis of OSB especially for smart cities. In this paper, we
utilize content-localization based neural machine translation to develop
sentiment and hate classifiers for two low-resourced Arabic dialects: Levantine
and Gulf. Not only this but we also leverage unsupervised learning to
facilitate the analysis of sentiment and hate predictions by inferring hidden
topics from the corresponding data and providing coherent interpretations of
those topics in their native language/dialects. The experimental evaluations
and proof-of-concept COVID-19 case study on real data have validated the
effectiveness of our proposed system in precisely distinguishing sentiments and
accurately identifying hate content in both Levantine and Gulf Arabic dialects.
Our findings shed light on the importance of considering the unique nature of
dialects within the same language and ignoring the dialectal aspect would lead
to misleading analysis.
Related papers
- Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Content-Localization based Neural Machine Translation for Informal
Dialectal Arabic: Spanish/French to Levantine/Gulf Arabic [5.2957928879391]
We propose a framework that localizes contents of high-resource languages to a low-resource language/dialects by utilizing AI power.
We are the first work to provide a parallel translation dataset from/to informal Spanish and French to/from informal Arabic dialects.
arXiv Detail & Related papers (2023-12-12T01:42:41Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - Arabic Sentiment Analysis with Noisy Deep Explainable Model [48.22321420680046]
This paper proposes an explainable sentiment classification framework for the Arabic language.
The proposed framework can explain specific predictions by training a local surrogate explainable model.
We carried out experiments on public benchmark Arabic SA datasets.
arXiv Detail & Related papers (2023-09-24T19:26:53Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - A simple language-agnostic yet very strong baseline system for hate
speech and offensive content identification [0.0]
A system based on a classical supervised algorithm only fed with character n-grams, and thus completely language-agnostic, is proposed.
It reached a medium performance level in English, the language for which it is easy to develop deep learning approaches.
It ends even first when performances are averaged over the three tasks in these languages, outperforming many deep learning approaches.
arXiv Detail & Related papers (2022-02-05T08:09:09Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - DeepHateExplainer: Explainable Hate Speech Detection in Under-resourced
Bengali Language [1.2246649738388389]
We propose an explainable approach for hate speech detection from the under-resourced Bengali language.
In our approach, Bengali texts are first comprehensively preprocessed, before classifying them into political, personal, geopolitical, and religious hates.
Evaluations against machine learning (linear and tree-based models) and deep neural networks (i.e., CNN, Bi-LSTM, and Conv-LSTM with word embeddings) baselines yield F1 scores of 84%, 90%, 88%, and 88%, for political, personal, geopolitical, and religious hates, respectively.
arXiv Detail & Related papers (2020-12-28T16:46:03Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.