A Large-scale Dataset for Hate Speech Detection on Vietnamese Social
Media Texts
- URL: http://arxiv.org/abs/2103.11528v1
- Date: Mon, 22 Mar 2021 00:55:47 GMT
- Title: A Large-scale Dataset for Hate Speech Detection on Vietnamese Social
Media Texts
- Authors: Son T. Luu, Kiet Van Nguyen and Ngan Luu-Thuy Nguyen
- Abstract summary: ViHSD is a human-annotated dataset for automatically detecting hate speech on the social network.
This dataset contains over 30,000 comments, each comment in the dataset has one of three labels: CLEAN, OFFENSIVE, or HATE.
- Score: 0.32228025627337864
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In recent years, Vietnam witnesses the mass development of social network
users on different social platforms such as Facebook, Youtube, Instagram, and
Tiktok. On social medias, hate speech has become a critical problem for social
network users. To solve this problem, we introduce the ViHSD - a
human-annotated dataset for automatically detecting hate speech on the social
network. This dataset contains over 30,000 comments, each comment in the
dataset has one of three labels: CLEAN, OFFENSIVE, or HATE. Besides, we
introduce the data creation process for annotating and evaluating the quality
of the dataset. Finally, we evaluated the dataset by deep learning models and
transformer models.
Related papers
- Exploiting Hatred by Targets for Hate Speech Detection on Vietnamese Social Media Texts [0.0]
We first introduce the ViTHSD - a targeted hate speech detection dataset for Vietnamese Social Media Texts.
The dataset contains 10K comments, each comment is labeled to specific targets with three levels: clean, offensive, and hate.
The inter-annotator agreement obtained from the dataset is 0.45 by Cohen's Kappa index, which is indicated as a moderate level.
arXiv Detail & Related papers (2024-04-30T04:16:55Z) - OPSD: an Offensive Persian Social media Dataset and its baseline evaluations [2.356562319390226]
This paper introduces two offensive datasets for Persian language.
The first dataset comprises annotations provided by domain experts, while the second consists of a large collection of unlabeled data obtained through web crawling.
The obtained F1-scores for the three-class and two-class versions of the dataset were 76.9% and 89.9% for XLM-RoBERTa, respectively.
arXiv Detail & Related papers (2024-04-08T14:08:56Z) - Into the LAIONs Den: Investigating Hate in Multimodal Datasets [67.21783778038645]
This paper investigates the effect of scaling datasets on hateful content through a comparative audit of two datasets: LAION-400M and LAION-2B.
We found that hate content increased by nearly 12% with dataset scale, measured both qualitatively and quantitatively.
We also found that filtering dataset contents based on Not Safe For Work (NSFW) values calculated based on images alone does not exclude all the harmful content in alt-text.
arXiv Detail & Related papers (2023-11-06T19:00:05Z) - Learning From Free-Text Human Feedback -- Collect New Datasets Or Extend
Existing Ones? [57.16050211534735]
We investigate the types and frequency of free-text human feedback in commonly used dialog datasets.
Our findings provide new insights into the composition of the datasets examined, including error types, user response types, and the relations between them.
arXiv Detail & Related papers (2023-10-24T12:01:11Z) - CoSyn: Detecting Implicit Hate Speech in Online Conversations Using a
Context Synergized Hyperbolic Network [52.85130555886915]
CoSyn is a context-synergized neural network that explicitly incorporates user- and conversational context for detecting implicit hate speech in online conversations.
We show that CoSyn outperforms all our baselines in detecting implicit hate speech with absolute improvements in the range of 1.24% - 57.8%.
arXiv Detail & Related papers (2023-03-02T17:30:43Z) - BERT-based Ensemble Approaches for Hate Speech Detection [1.8734449181723825]
This paper focuses on classifying hate speech in social media using multiple deep models.
We evaluated with several ensemble techniques, including soft voting, maximum value, hard voting and stacking.
Experiments have shown good results especially the ensemble models, where stacking gave F1 score of 97% on Davidson dataset and aggregating ensembles 77% on the DHO dataset.
arXiv Detail & Related papers (2022-09-14T09:08:24Z) - BD-SHS: A Benchmark Dataset for Learning to Detect Online Bangla Hate
Speech in Different Social Contexts [1.5483942282713241]
This paper introduces a large manually labeled dataset that includes Hate Speech in different social contexts.
The dataset includes more than 50,200 offensive comments crawled from online social networking sites.
In experiments, we found that a word embedding trained exclusively using 1.47 million comments consistently resulted in better modeling of HS detection.
arXiv Detail & Related papers (2022-06-01T10:10:15Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Pchatbot: A Large-Scale Dataset for Personalized Chatbot [49.16746174238548]
We introduce Pchatbot, a large-scale dialogue dataset that contains two subsets collected from Weibo and Judicial forums respectively.
To adapt the raw dataset to dialogue systems, we elaborately normalize the raw dataset via processes such as anonymization.
The scale of Pchatbot is significantly larger than existing Chinese datasets, which might benefit the data-driven models.
arXiv Detail & Related papers (2020-09-28T12:49:07Z) - Empirical Study of Text Augmentation on Social Media Text in Vietnamese [3.0938904602244355]
In the text classification problem, the imbalance of labels in datasets affect the performance of the text-classification models.
The data augmentation techniques are applied to solve the imbalance problem between classes of the dataset.
The result of augmentation increases by about 1.5% in the F1-macro score on both corpora.
arXiv Detail & Related papers (2020-09-25T16:18:52Z) - Trawling for Trolling: A Dataset [56.1778095945542]
We present a dataset that models trolling as a subcategory of offensive content.
The dataset has 12,490 samples, split across 5 classes; Normal, Profanity, Trolling, Derogatory and Hate Speech.
arXiv Detail & Related papers (2020-08-02T17:23:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.