Listening to Affected Communities to Define Extreme Speech: Dataset and
Experiments
- URL: http://arxiv.org/abs/2203.11764v1
- Date: Tue, 22 Mar 2022 14:24:56 GMT
- Title: Listening to Affected Communities to Define Extreme Speech: Dataset and
Experiments
- Authors: Antonis Maronikolakis, Axel Wisiorek, Leah Nann, Haris Jabbar, Sahana
Udupa, Hinrich Schuetze
- Abstract summary: We present XTREMESPEECH, a new hate speech dataset containing 20,297 social media passages from Brazil, Germany, India and Kenya.
The key novelty is that we directly involve the affected communities in collecting and annotating the data.
This inclusive approach results in datasets more representative of actually occurring online speech.
- Score: 1.1417805445492082
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Building on current work on multilingual hate speech (e.g., Ousidhoum et al.
(2019)) and hate speech reduction (e.g., Sap et al. (2020)), we present
XTREMESPEECH, a new hate speech dataset containing 20,297 social media passages
from Brazil, Germany, India and Kenya. The key novelty is that we directly
involve the affected communities in collecting and annotating the data - as
opposed to giving companies and governments control over defining and
combatting hate speech. This inclusive approach results in datasets more
representative of actually occurring online speech and is likely to facilitate
the removal of the social media content that marginalized communities view as
causing the most harm. Based on XTREMESPEECH, we establish novel tasks with
accompanying baselines, provide evidence that cross-country training is
generally not feasible due to cultural differences between countries and
perform an interpretability analysis of BERT's predictions.
Related papers
- A Federated Approach to Few-Shot Hate Speech Detection for Marginalized Communities [43.37824420609252]
Hate speech online remains an understudied issue for marginalized communities.
In this paper, we aim to provide marginalized communities living in societies where the dominant language is low-resource with a privacy-preserving tool to protect themselves from hate speech on the internet.
arXiv Detail & Related papers (2024-12-06T11:00:05Z) - A Survey on Automatic Online Hate Speech Detection in Low-Resource Languages [0.5825410941577593]
Social media and easy accessibility of the internet has facilitated the spread of hate speech.
This article provides a detailed survey of hate speech detection in low-resource languages around the world.
arXiv Detail & Related papers (2024-11-28T09:42:53Z) - IndoToxic2024: A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language [11.463652750122398]
We introduce IndoToxic2024, a comprehensive Indonesian hate speech and toxicity classification dataset.
Comprising 43,692 entries annotated by 19 diverse individuals, the dataset focuses on texts targeting vulnerable groups.
We establish baselines for seven binary classification tasks, achieving a macro-F1 score of 0.78 with a BERT model fine-tuned for hate speech classification.
arXiv Detail & Related papers (2024-06-27T17:26:38Z) - ViTHSD: Exploiting Hatred by Targets for Hate Speech Detection on Vietnamese Social Media Texts [0.0]
We first introduce the ViTHSD - a targeted hate speech detection dataset for Vietnamese Social Media Texts.
The dataset contains 10K comments, each comment is labeled to specific targets with three levels: clean, offensive, and hate.
The inter-annotator agreement obtained from the dataset is 0.45 by Cohen's Kappa index, which is indicated as a moderate level.
arXiv Detail & Related papers (2024-04-30T04:16:55Z) - Exploring Cross-Cultural Differences in English Hate Speech Annotations: From Dataset Construction to Analysis [44.17106903728264]
Most hate speech datasets neglect the cultural diversity within a single language.
To address this, we introduce CREHate, a CRoss-cultural English Hate speech dataset.
Only 56.2% of the posts in CREHate achieve consensus among all countries, with the highest pairwise label difference rate of 26%.
arXiv Detail & Related papers (2023-08-31T13:14:47Z) - CoSyn: Detecting Implicit Hate Speech in Online Conversations Using a
Context Synergized Hyperbolic Network [52.85130555886915]
CoSyn is a context-synergized neural network that explicitly incorporates user- and conversational context for detecting implicit hate speech in online conversations.
We show that CoSyn outperforms all our baselines in detecting implicit hate speech with absolute improvements in the range of 1.24% - 57.8%.
arXiv Detail & Related papers (2023-03-02T17:30:43Z) - Assessing the impact of contextual information in hate speech detection [0.48369513656026514]
We provide a novel corpus for contextualized hate speech detection based on user responses to news posts from media outlets on Twitter.
This corpus was collected in the Rioplatense dialectal variety of Spanish and focuses on hate speech associated with the COVID-19 pandemic.
arXiv Detail & Related papers (2022-10-02T09:04:47Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z) - Racism is a Virus: Anti-Asian Hate and Counterspeech in Social Media
during the COVID-19 Crisis [51.39895377836919]
COVID-19 has sparked racism and hate on social media targeted towards Asian communities.
We study the evolution and spread of anti-Asian hate speech through the lens of Twitter.
We create COVID-HATE, the largest dataset of anti-Asian hate and counterspeech spanning 14 months.
arXiv Detail & Related papers (2020-05-25T21:58:09Z) - Transfer Learning for Hate Speech Detection in Social Media [14.759208309842178]
This paper uses a transfer learning technique to leverage two independent datasets jointly.
We build an interpretable two-dimensional visualization tool of the constructed hate speech representation -- dubbed the Map of Hate.
We show that the joint representation boosts prediction performances when only a limited amount of supervision is available.
arXiv Detail & Related papers (2019-06-10T08:00:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.