Antisemitic Messages? A Guide to High-Quality Annotation and a Labeled
Dataset of Tweets
- URL: http://arxiv.org/abs/2304.14599v1
- Date: Fri, 28 Apr 2023 02:52:38 GMT
- Title: Antisemitic Messages? A Guide to High-Quality Annotation and a Labeled
Dataset of Tweets
- Authors: Gunther Jikeli, Sameer Karali, Daniel Miehling, and Katharina Soemer
- Abstract summary: We create a labeled dataset of 6,941 tweets that cover a wide range of topics common in conversations about Jews, Israel, and antisemitism.
The dataset includes 1,250 tweets (18%) that are antisemitic according to the International Holocaust Remembrance Alliance (IHRA) definition of antisemitism.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: One of the major challenges in automatic hate speech detection is the lack of
datasets that cover a wide range of biased and unbiased messages and that are
consistently labeled. We propose a labeling procedure that addresses some of
the common weaknesses of labeled datasets. We focus on antisemitic speech on
Twitter and create a labeled dataset of 6,941 tweets that cover a wide range of
topics common in conversations about Jews, Israel, and antisemitism between
January 2019 and December 2021 by drawing from representative samples with
relevant keywords. Our annotation process aims to strictly apply a commonly
used definition of antisemitism by forcing annotators to specify which part of
the definition applies, and by giving them the option to personally disagree
with the definition on a case-by-case basis. Labeling tweets that call out
antisemitism, report antisemitism, or are otherwise related to antisemitism
(such as the Holocaust) but are not actually antisemitic can help reduce false
positives in automated detection. The dataset includes 1,250 tweets (18%) that
are antisemitic according to the International Holocaust Remembrance Alliance
(IHRA) definition of antisemitism. It is important to note, however, that the
dataset is not comprehensive. Many topics are still not covered, and it only
includes tweets collected from Twitter between January 2019 and December 2021.
Additionally, the dataset only includes tweets that were written in English.
Despite these limitations, we hope that this is a meaningful contribution to
improving the automated detection of antisemitic speech.
Related papers
- Monitoring the evolution of antisemitic discourse on extremist social media using BERT [3.3037858066178662]
Racism and intolerance on social media contribute to a toxic online environment which may spill offline to foster hatred.
Tracking antisemitic themes and their associated terminology over time in online discussions could help monitor the sentiments of their participants.
arXiv Detail & Related papers (2024-02-06T20:34:49Z) - What Do Llamas Really Think? Revealing Preference Biases in Language
Model Representations [62.91799637259657]
Do large language models (LLMs) exhibit sociodemographic biases, even when they decline to respond?
We study this research question by probing contextualized embeddings and exploring whether this bias is encoded in its latent representations.
We propose a logistic Bradley-Terry probe which predicts word pair preferences of LLMs from the words' hidden vectors.
arXiv Detail & Related papers (2023-11-30T18:53:13Z) - How toxic is antisemitism? Potentials and limitations of automated
toxicity scoring for antisemitic online content [0.0]
Perspective API is a text toxicity assessment service by Google and Jigsaw.
We show how toxic antisemitic texts are rated and how the toxicity scores differ regarding different subforms of antisemitism.
We show that, on a basic level, Perspective API recognizes antisemitic content as toxic, but shows critical weaknesses with respect to non-explicit forms of antisemitism.
arXiv Detail & Related papers (2023-10-05T15:23:04Z) - Russo-Ukrainian War: Prediction and explanation of Twitter suspension [47.61306219245444]
This study focuses on the Twitter suspension mechanism and the analysis of shared content and features of user accounts that may lead to this.
We have obtained a dataset containing 107.7M tweets, originating from 9.8 million users, using Twitter API.
Our results reveal scam campaigns taking advantage of trending topics regarding the Russia-Ukrainian conflict for Bitcoin fraud, spam, and advertisement campaigns.
arXiv Detail & Related papers (2023-06-06T08:41:02Z) - Codes, Patterns and Shapes of Contemporary Online Antisemitism and
Conspiracy Narratives -- an Annotation Guide and Labeled German-Language
Dataset in the Context of COVID-19 [0.0]
Antisemitic and conspiracy theory content on the Internet makes data-driven algorithmic approaches essential.
We develop an annotation guide for antisemitic and conspiracy theory online content in the context of the COVID-19 pandemic.
We provide working definitions, including specific forms of antisemitism such as encoded and post-Holocaust antisemitism.
arXiv Detail & Related papers (2022-10-13T10:32:39Z) - Predicting Hate Intensity of Twitter Conversation Threads [26.190359413890537]
We propose DRAGNET++, which aims to predict the intensity of hatred that a tweet can bring in through its reply chain in the future.
It uses the semantic and propagating structure of the tweet threads to maximize the contextual information leading up to and the fall of hate intensity at each subsequent tweet.
We show that DRAGNET++ outperforms all the state-of-the-art baselines significantly.
arXiv Detail & Related papers (2022-06-16T18:51:36Z) - Twitter Dataset on the Russo-Ukrainian War [68.713984286035]
We have initiated an ongoing dataset acquisition from Twitter API.
The dataset has reached the amount of 57.3 million tweets, originating from 7.7 million users.
We apply an initial volume and sentiment analysis, while the dataset can be used to further exploratory investigation towards topic analysis, hate speech, propaganda recognition, or even show potential malicious entities like botnets.
arXiv Detail & Related papers (2022-04-07T12:33:06Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - "Subverting the Jewtocracy": Online Antisemitism Detection Using
Multimodal Deep Learning [23.048101866010445]
We present the first work in the direction of automated multimodal detection of online antisemitism.
We label two datasets with 3,102 and 3,509 social media posts from Twitter and Gab respectively.
We present a multimodal deep learning system that detects the presence of antisemitic content and its specific antisemitism category using text and images from posts.
arXiv Detail & Related papers (2021-04-13T05:22:55Z) - Trawling for Trolling: A Dataset [56.1778095945542]
We present a dataset that models trolling as a subcategory of offensive content.
The dataset has 12,490 samples, split across 5 classes; Normal, Profanity, Trolling, Derogatory and Hate Speech.
arXiv Detail & Related papers (2020-08-02T17:23:55Z) - Racism is a Virus: Anti-Asian Hate and Counterspeech in Social Media
during the COVID-19 Crisis [51.39895377836919]
COVID-19 has sparked racism and hate on social media targeted towards Asian communities.
We study the evolution and spread of anti-Asian hate speech through the lens of Twitter.
We create COVID-HATE, the largest dataset of anti-Asian hate and counterspeech spanning 14 months.
arXiv Detail & Related papers (2020-05-25T21:58:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.