Hostility Detection Dataset in Hindi
- URL: http://arxiv.org/abs/2011.03588v1
- Date: Fri, 6 Nov 2020 20:33:12 GMT
- Title: Hostility Detection Dataset in Hindi
- Authors: Mohit Bhardwaj, Md Shad Akhtar, Asif Ekbal, Amitava Das, Tanmoy
Chakraborty
- Abstract summary: We collect and manually annotate 8200 online posts in Hindi language.
The dataset is considered for multi-label tags due to a significant overlap among the hostile classes.
- Score: 44.221862384125245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a novel hostility detection dataset in Hindi
language. We collect and manually annotate ~8200 online posts. The annotated
dataset covers four hostility dimensions: fake news, hate speech, offensive,
and defamation posts, along with a non-hostile label. The hostile posts are
also considered for multi-label tags due to a significant overlap among the
hostile classes. We release this dataset as part of the CONSTRAINT-2021 shared
task on hostile post detection.
Related papers
- Into the LAIONs Den: Investigating Hate in Multimodal Datasets [67.21783778038645]
This paper investigates the effect of scaling datasets on hateful content through a comparative audit of two datasets: LAION-400M and LAION-2B.
We found that hate content increased by nearly 12% with dataset scale, measured both qualitatively and quantitatively.
We also found that filtering dataset contents based on Not Safe For Work (NSFW) values calculated based on images alone does not exclude all the harmful content in alt-text.
arXiv Detail & Related papers (2023-11-06T19:00:05Z) - How to Solve Few-Shot Abusive Content Detection Using the Data We Actually Have [58.23138483086277]
In this work we leverage datasets we already have, covering a wide range of tasks related to abusive language detection.
Our goal is to build models cheaply for a new target label set and/or language, using only a few training examples of the target domain.
Our experiments show that using already existing datasets and only a few-shots of the target task the performance of models improve both monolingually and across languages.
arXiv Detail & Related papers (2023-05-23T14:04:12Z) - Overview of Abusive and Threatening Language Detection in Urdu at FIRE
2021 [50.591267188664666]
We present two shared tasks of abusive and threatening language detection for the Urdu language.
We present two manually annotated datasets containing tweets labelled as (i) Abusive and Non-Abusive, and (ii) Threatening and Non-Threatening.
For both subtasks, m-Bert based transformer model showed the best performance.
arXiv Detail & Related papers (2022-07-14T07:38:13Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - AraCOVID19-MFH: Arabic COVID-19 Multi-label Fake News and Hate Speech
Detection Dataset [0.0]
"AraCOVID19-MFH" is a manually annotated multi-label Arabic COVID-19 fake news and hate speech detection dataset.
Our dataset contains 10,828 Arabic tweets annotated with 10 different labels.
It can also be used for hate speech detection, opinion/news classification, dialect identification, and many other tasks.
arXiv Detail & Related papers (2021-05-07T09:52:44Z) - Divide and Conquer: An Ensemble Approach for Hostile Post Detection in
Hindi [25.723773314371947]
The data for this task is provided in Hindi Devanagari script which was collected from Twitter and Facebook.
It is a multi-label multi-class classification problem where each data instance is annotated into one or more of the five classes: fake, hate, offensive, defamation, and non-hostile.
Our team 'Albatross', scored 0.9709 Coarse grained hostility F1 score measure on Hostile Post Detection in Hindi subtask and secured 2nd rank out of 45 teams for the task.
arXiv Detail & Related papers (2021-01-20T05:38:07Z) - Hostility Detection in Hindi leveraging Pre-Trained Language Models [1.6436293069942312]
This paper presents a transfer learning based approach to classify social media posts in Hindi Devanagari script as Hostile or Non-Hostile.
Hostile posts are further analyzed to determine if they are Hateful, Fake, Defamation, and Offensive.
We establish a robust and consistent model without any ensembling or complex pre-processing.
arXiv Detail & Related papers (2021-01-14T08:04:32Z) - Coarse and Fine-Grained Hostility Detection in Hindi Posts using Fine
Tuned Multilingual Embeddings [4.3012765978447565]
The hostility detection task has been well explored for resource-rich languages like English, but is unexplored for resource-constrained languages like Hindidue to the unavailability of large suitable data.
We propose an effective neural network-based technique for hostility detection in Hindi posts.
arXiv Detail & Related papers (2021-01-13T11:00:31Z) - Evaluation of Deep Learning Models for Hostility Detection in Hindi Text [2.572404739180802]
We present approaches for hostile text detection in the Hindi language.
The proposed approaches are evaluated on the Constraint@AAAI 2021 Hindi hostility detection dataset.
We evaluate a host of deep learning approaches based on CNN, LSTM, and BERT for this multi-label classification problem.
arXiv Detail & Related papers (2021-01-11T19:10:57Z) - Trawling for Trolling: A Dataset [56.1778095945542]
We present a dataset that models trolling as a subcategory of offensive content.
The dataset has 12,490 samples, split across 5 classes; Normal, Profanity, Trolling, Derogatory and Hate Speech.
arXiv Detail & Related papers (2020-08-02T17:23:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.