All You Need is "Leet": Evading Hate-speech Detection AI
- URL: http://arxiv.org/abs/2505.16263v1
- Date: Thu, 22 May 2025 05:55:26 GMT
- Title: All You Need is "Leet": Evading Hate-speech Detection AI
- Authors: Sampanna Yashwant Kahu, Naman Ahuja,
- Abstract summary: In this paper, we design black-box techniques to protect users from hate-speech on online platforms.<n>Our best perturbation attack is successfully able to evade hate-speech detection for 86.8 % of hateful text.
- Score: 0.6906005491572401
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Social media and online forums are increasingly becoming popular. Unfortunately, these platforms are being used for spreading hate speech. In this paper, we design black-box techniques to protect users from hate-speech on online platforms by generating perturbations that can fool state of the art deep learning based hate speech detection models thereby decreasing their efficiency. We also ensure a minimal change in the original meaning of hate-speech. Our best perturbation attack is successfully able to evade hate-speech detection for 86.8 % of hateful text.
Related papers
- ProvocationProbe: Instigating Hate Speech Dataset from Twitter [0.39052860539161904]
textitProvocationProbe is a dataset designed to explore what distinguishes instigating hate speech from general hate speech.
For this study, we collected around twenty thousand tweets from Twitter, encompassing a total of nine global controversies.
arXiv Detail & Related papers (2024-10-25T16:57:59Z) - Hostile Counterspeech Drives Users From Hate Subreddits [1.5035331281822]
We analyze the effect of counterspeech on newcomers within hate subreddits on Reddit.
Non-hostile counterspeech is ineffective at keeping users from fully disengaging from these hate subreddits.
A single hostile counterspeech comment substantially reduces both future likelihood of engagement.
arXiv Detail & Related papers (2024-05-28T17:12:41Z) - NLP Systems That Can't Tell Use from Mention Censor Counterspeech, but Teaching the Distinction Helps [43.40965978436158]
Counterspeech that refutes problematic content often mentions harmful language but is not harmful itself.
We show that even recent language models fail at distinguishing use from mention.
This failure propagates to two key downstream tasks: misinformation and hate speech detection.
arXiv Detail & Related papers (2024-04-02T05:36:41Z) - An Investigation of Large Language Models for Real-World Hate Speech
Detection [46.15140831710683]
A major limitation of existing methods is that hate speech detection is a highly contextual problem.
Recently, large language models (LLMs) have demonstrated state-of-the-art performance in several natural language tasks.
Our study reveals that a meticulously crafted reasoning prompt can effectively capture the context of hate speech.
arXiv Detail & Related papers (2024-01-07T00:39:33Z) - Overview of the HASOC Subtrack at FIRE 2023: Identification of Tokens
Contributing to Explicit Hate in English by Span Detection [40.10513344092731]
Reactively, using black-box models to identify hateful content can perplex users as to why their posts were automatically flagged as hateful.
proactive mitigation can be achieved by suggesting rephrasing before a post is made public.
arXiv Detail & Related papers (2023-11-16T12:01:19Z) - Analyzing User Characteristics of Hate Speech Spreaders on Social Media [20.57872238271025]
We analyze the role of user characteristics in hate speech resharing across different types of hate speech.<n>We find that users with little social influence tend to share more hate speech.<n>Political anti-Trump and anti-right-wing hate is reshared by users with larger social influence.
arXiv Detail & Related papers (2023-10-24T12:17:48Z) - CoSyn: Detecting Implicit Hate Speech in Online Conversations Using a
Context Synergized Hyperbolic Network [52.85130555886915]
CoSyn is a context-synergized neural network that explicitly incorporates user- and conversational context for detecting implicit hate speech in online conversations.
We show that CoSyn outperforms all our baselines in detecting implicit hate speech with absolute improvements in the range of 1.24% - 57.8%.
arXiv Detail & Related papers (2023-03-02T17:30:43Z) - Hate Speech Classification Using SVM and Naive BAYES [0.0]
Many countries have developed laws to avoid online hate speech.
But as online content continues to grow, so does the spread of hate speech.
It is important to automatically process the online user contents to detect and remove hate speech.
arXiv Detail & Related papers (2022-03-21T17:15:38Z) - Deep Learning for Hate Speech Detection: A Comparative Study [54.42226495344908]
We present here a large-scale empirical comparison of deep and shallow hate-speech detection methods.
Our goal is to illuminate progress in the area, and identify strengths and weaknesses in the current state-of-the-art.
In doing so we aim to provide guidance as to the use of hate-speech detection in practice, quantify the state-of-the-art, and identify future research directions.
arXiv Detail & Related papers (2022-02-19T03:48:20Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z) - Countering Online Hate Speech: An NLP Perspective [34.19875714256597]
Online toxicity - an umbrella term for online hateful behavior - manifests itself in forms such as online hate speech.
The rising mass communication through social media further exacerbates the harmful consequences of online hate speech.
This paper presents a holistic conceptual framework on hate-speech NLP countering methods along with a thorough survey on the current progress of NLP for countering online hate speech.
arXiv Detail & Related papers (2021-09-07T08:48:13Z) - Racism is a Virus: Anti-Asian Hate and Counterspeech in Social Media
during the COVID-19 Crisis [51.39895377836919]
COVID-19 has sparked racism and hate on social media targeted towards Asian communities.
We study the evolution and spread of anti-Asian hate speech through the lens of Twitter.
We create COVID-HATE, the largest dataset of anti-Asian hate and counterspeech spanning 14 months.
arXiv Detail & Related papers (2020-05-25T21:58:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.