Related papers: ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection

ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection

URL: http://arxiv.org/abs/2203.09509v1
Date: Thu, 17 Mar 2022 17:57:56 GMT
Title: ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection
Authors: Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, Ece Kamar
Abstract summary: ToxiGen is a large-scale dataset of 274k toxic and benign statements about 13 minority groups. Controlling machine generation in this way allows ToxiGen to cover implicitly toxic text at a larger scale. We find that 94.5% of toxic examples are labeled as hate speech by human annotators.
Score: 33.715318646717385
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Toxic language detection systems often falsely flag text that contains minority group mentions as toxic, as those groups are often the targets of online hate. Such over-reliance on spurious correlations also causes systems to struggle with detecting implicitly toxic language. To help mitigate these issues, we create ToxiGen, a new large-scale and machine-generated dataset of 274k toxic and benign statements about 13 minority groups. We develop a demonstration-based prompting framework and an adversarial classifier-in-the-loop decoding method to generate subtly toxic and benign text with a massive pretrained language model. Controlling machine generation in this way allows ToxiGen to cover implicitly toxic text at a larger scale, and about more demographic groups, than previous resources of human-written text. We conduct a human evaluation on a challenging subset of ToxiGen and find that annotators struggle to distinguish machine-generated text from human-written language. We also find that 94.5% of toxic examples are labeled as hate speech by human annotators. Using three publicly-available datasets, we show that finetuning a toxicity classifier on our data improves its performance on human-written data substantially. We also demonstrate that ToxiGen can be used to fight machine-generated toxicity as finetuning improves the classifier significantly on our evaluation subset.

Related papers

Something Just Like TRuST : Toxicity Recognition of Span and Target [2.4169078025984825]
This paper introduces TRuST, a comprehensive dataset designed to improve toxicity detection.<n>We benchmark state-of-the-art large language models (LLMs) on toxicity detection, target group identification, and toxic span extraction.<n>We find that fine-tuned models consistently outperform zero-shot and few-shot prompting, though performance remains low for certain social groups.
arXiv Detail & Related papers (2025-06-02T23:48:16Z)
On the Role of Speech Data in Reducing Toxicity Detection Bias [22.44133159647888]
We produce a set of high-quality group annotations for the multilingual MuTox dataset. We then leverage these annotations to systematically compare speech- and text-based toxicity classifiers. Our findings indicate that access to speech data during inference supports reduced bias against group mentions.
arXiv Detail & Related papers (2024-11-12T19:26:43Z)
Towards Building a Robust Toxicity Predictor [13.162016701556725]
This paper presents a novel adversarial attack, texttToxicTrap, introducing small word-level perturbations to fool SOTA text classifiers to predict toxic text samples as benign. Two novel goal function designs allow ToxicTrap to identify weaknesses in both multiclass and multilabel toxic language detectors.
arXiv Detail & Related papers (2024-04-09T22:56:05Z)
ToXCL: A Unified Framework for Toxic Speech Detection and Explanation [3.803993344850168]
ToXCL is a unified framework for the detection and explanation of implicit toxic speech. ToXCL achieves new state-of-the-art effectiveness, and outperforms baselines significantly.
arXiv Detail & Related papers (2024-03-25T12:21:38Z)
Parameter-Efficient Detoxification with Contrastive Decoding [78.5124331048714]
We introduce Detoxification Generator (DETOXIGEN), an inference-time algorithm that steers the generation away from unwanted styles. During the actual generation, we use the trained detoxifier to produce undesirable tokens for the generator to contrast against at each decoding step. We find that it significantly outperforms previous approaches in detoxification metrics while not compromising on the generation quality.
arXiv Detail & Related papers (2024-01-13T01:46:20Z)
ToViLaG: Your Visual-Language Generative Model is Also An Evildoer [36.60526586838288]
Recent large-scale Visual-Language Generative Models (VLGMs) have achieved unprecedented improvement in multimodal image/text generation. These models might also generate toxic content, e.g., offensive text and pornography images, raising significant ethical risks. This work delves into the propensity for toxicity generation and susceptibility to toxic data across various VLGMs.
arXiv Detail & Related papers (2023-12-13T08:25:07Z)
Unveiling the Implicit Toxicity in Large Language Models [77.90933074675543]
The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use. We show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting. We propose a reinforcement learning (RL) based attacking method to further induce the implicit toxicity in LLMs.
arXiv Detail & Related papers (2023-11-29T06:42:36Z)
Constructing Highly Inductive Contexts for Dialogue Safety through Controllable Reverse Generation [65.48908724440047]
We propose a method called emphreverse generation to construct adversarial contexts conditioned on a given response. We test three popular pretrained dialogue models (Blender, DialoGPT, and Plato2) and find that BAD+ can largely expose their safety problems.
arXiv Detail & Related papers (2022-12-04T12:23:41Z)
Mitigating Biases in Toxic Language Detection through Invariant Rationalization [70.36701068616367]
biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection. We propose to use invariant rationalization (InvRat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns. Our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.
arXiv Detail & Related papers (2021-06-14T08:49:52Z)
Challenges in Automated Debiasing for Toxic Language Detection [81.04406231100323]
Biased associations have been a challenge in the development of classifiers for detecting toxic language. We investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection. Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English)
arXiv Detail & Related papers (2021-01-29T22:03:17Z)
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.