Toxic language detection: a systematic review of Arabic datasets
- URL: http://arxiv.org/abs/2312.07228v2
- Date: Mon, 29 Jan 2024 21:34:27 GMT
- Title: Toxic language detection: a systematic review of Arabic datasets
- Authors: Imene Bensalem, Paolo Rosso, Hanane Zitouni
- Abstract summary: This paper offers a comprehensive survey of Arabic datasets focused on online toxic language.
We systematically gathered a total of 54 available datasets and their corresponding papers.
For the convenience of the research community, the list of the analysed datasets is maintained in a GitHub repository.
- Score: 5.945303394300328
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The detection of toxic language in the Arabic language has emerged as an
active area of research in recent years, and reviewing the existing datasets
employed for training the developed solutions has become a pressing need. This
paper offers a comprehensive survey of Arabic datasets focused on online toxic
language. We systematically gathered a total of 54 available datasets and their
corresponding papers and conducted a thorough analysis, considering 18 criteria
across four primary dimensions: availability details, content, annotation
process, and reusability. This analysis enabled us to identify existing gaps
and make recommendations for future research works. For the convenience of the
research community, the list of the analysed datasets is maintained in a GitHub
repository (https://github.com/Imene1/Arabic-toxic-language).
Related papers
- A Study on Scaling Up Multilingual News Framing Analysis [23.80807884935475]
This study explores the possibility of dataset creation through crowdsourcing.
We first extend framing analysis beyond English news to a multilingual context.
We also present a novel benchmark in Bengali and Portuguese on the immigration and same-sex marriage domains.
arXiv Detail & Related papers (2024-04-01T21:02:18Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Text2Analysis: A Benchmark of Table Question Answering with Advanced
Data Analysis and Unclear Queries [67.0083902913112]
We develop the Text2Analysis benchmark, incorporating advanced analysis tasks.
We also develop five innovative and effective annotation methods.
We evaluate five state-of-the-art models using three different metrics.
arXiv Detail & Related papers (2023-12-21T08:50:41Z) - When a Language Question Is at Stake. A Revisited Approach to Label
Sensitive Content [0.0]
Article revisits an approach of pseudo-labeling sensitive data on the example of Ukrainian tweets covering the Russian-Ukrainian war.
We provide a fundamental statistical analysis of the obtained data, evaluation of models used for pseudo-labelling, and set further guidelines on how the scientists can leverage the corpus.
arXiv Detail & Related papers (2023-11-17T13:35:10Z) - DN at SemEval-2023 Task 12: Low-Resource Language Text Classification
via Multilingual Pretrained Language Model Fine-tuning [0.0]
Most existing models and datasets for sentiment analysis are developed for high-resource languages, such as English and Chinese.
The AfriSenti-SemEval 2023 Shared Task 12 aims to fill this gap by evaluating sentiment analysis models on low-resource African languages.
We present our solution to the shared task, where we employed different multilingual XLM-R models with classification head trained on various data.
arXiv Detail & Related papers (2023-05-04T07:28:45Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - A Survey on non-English Question Answering Dataset [0.0]
The aim of this survey is to recognize, summarize and analyze the existing datasets that have been released by many researchers.
In this paper, we review question answering datasets that are available in common languages other than English such as French, German, Japanese, Chinese, Arabic, Russian, as well as the multilingual and cross-lingual question-answering datasets.
arXiv Detail & Related papers (2021-12-27T12:45:06Z) - Exploratory Arabic Offensive Language Dataset Analysis [0.0]
This paper adds more insights towards resources and datasets used in Arabic offensive language research.
The main goal of this paper is to guide researchers in Arabic offensive language in selecting appropriate datasets based on their content.
arXiv Detail & Related papers (2021-01-20T23:45:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.