HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns
- URL: http://arxiv.org/abs/2501.16750v1
- Date: Tue, 28 Jan 2025 07:00:45 GMT
- Title: HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns
- Authors: Xinyue Shen, Yixin Wu, Yiting Qu, Michael Backes, Savvas Zannettou, Yang Zhang,
- Abstract summary: Large Language Models (LLMs) have raised increasing concerns about their misuse in generating hate speech.
We propose HateBench, a framework for benchmarking hate speech detectors on LLM-generated hate speech.
Our results show that while detectors are generally effective in identifying LLM-generated hate speech, their performance degrades with newer versions of LLMs.
- Score: 29.913089752247362
- License:
- Abstract: Large Language Models (LLMs) have raised increasing concerns about their misuse in generating hate speech. Among all the efforts to address this issue, hate speech detectors play a crucial role. However, the effectiveness of different detectors against LLM-generated hate speech remains largely unknown. In this paper, we propose HateBench, a framework for benchmarking hate speech detectors on LLM-generated hate speech. We first construct a hate speech dataset of 7,838 samples generated by six widely-used LLMs covering 34 identity groups, with meticulous annotations by three labelers. We then assess the effectiveness of eight representative hate speech detectors on the LLM-generated dataset. Our results show that while detectors are generally effective in identifying LLM-generated hate speech, their performance degrades with newer versions of LLMs. We also reveal the potential of LLM-driven hate campaigns, a new threat that LLMs bring to the field of hate speech detection. By leveraging advanced techniques like adversarial attacks and model stealing attacks, the adversary can intentionally evade the detector and automate hate campaigns online. The most potent adversarial attack achieves an attack success rate of 0.966, and its attack efficiency can be further improved by $13-21\times$ through model stealing attacks with acceptable attack performance. We hope our study can serve as a call to action for the research community and platform moderators to fortify defenses against these emerging threats.
Related papers
- Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech Countering [22.594296353433855]
We focus on two aspects of counterspeech generation to produce more cogent responses.
First, we test whether the presence of safety guardrails hinders the quality of the generations.
Secondly, we assess whether attacking a specific component of the hate speech results in a more effective argumentative strategy to fight online hate.
arXiv Detail & Related papers (2024-10-04T14:31:37Z) - Decoding Hate: Exploring Language Models' Reactions to Hate Speech [2.433983268807517]
This paper investigates the reactions of seven state-of-the-art Large Language Models to hate speech.
We reveal the spectrum of responses these models produce, highlighting their capacity to handle hate speech inputs.
We also discuss strategies to mitigate hate speech generation by LLMs, particularly through fine-tuning and guideline guardrailing.
arXiv Detail & Related papers (2024-10-01T15:16:20Z) - HateTinyLLM : Hate Speech Detection Using Tiny Large Language Models [0.0]
Hate speech encompasses verbal, written, or behavioral communication that targets derogatory or discriminatory language against individuals or groups.
HateTinyLLM is a novel framework based on fine-tuned decoder-only tiny large language models (tinyLLMs) for efficient hate speech detection.
arXiv Detail & Related papers (2024-04-26T05:29:35Z) - ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text.
Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z) - Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements.
We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z) - An Investigation of Large Language Models for Real-World Hate Speech
Detection [46.15140831710683]
A major limitation of existing methods is that hate speech detection is a highly contextual problem.
Recently, large language models (LLMs) have demonstrated state-of-the-art performance in several natural language tasks.
Our study reveals that a meticulously crafted reasoning prompt can effectively capture the context of hate speech.
arXiv Detail & Related papers (2024-01-07T00:39:33Z) - HateRephrase: Zero- and Few-Shot Reduction of Hate Intensity in Online
Posts using Large Language Models [4.9711707739781215]
This paper investigates an approach of suggesting a rephrasing of potential hate speech content even before the post is made.
We develop 4 different prompts based on task description, hate definition, few-shot demonstrations and chain-of-thoughts.
We find that GPT-3.5 outperforms the baseline and open-source models for all the different kinds of prompts.
arXiv Detail & Related papers (2023-10-21T12:18:29Z) - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs)
Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z) - Red Teaming Language Model Detectors with Language Models [114.36392560711022]
Large language models (LLMs) present significant safety and ethical risks if exploited by malicious users.
Recent works have proposed algorithms to detect LLM-generated text and protect LLMs.
We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation.
arXiv Detail & Related papers (2023-05-31T10:08:37Z) - Addressing the Challenges of Cross-Lingual Hate Speech Detection [115.1352779982269]
In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages.
We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply it to the target language.
We investigate the issue of label imbalance of hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance.
arXiv Detail & Related papers (2022-01-15T20:48:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.