Related papers: Towards Building a Robust Toxicity Predictor

Towards Building a Robust Toxicity Predictor

URL: http://arxiv.org/abs/2404.08690v1
Date: Tue, 9 Apr 2024 22:56:05 GMT
Title: Towards Building a Robust Toxicity Predictor
Authors: Dmitriy Bespalov, Sourav Bhabesh, Yi Xiang, Liutong Zhou, Yanjun Qi,
Abstract summary: This paper presents a novel adversarial attack, texttToxicTrap, introducing small word-level perturbations to fool SOTA text classifiers to predict toxic text samples as benign. Two novel goal function designs allow ToxicTrap to identify weaknesses in both multiclass and multilabel toxic language detectors.
Score: 13.162016701556725
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent NLP literature pays little attention to the robustness of toxicity language predictors, while these systems are most likely to be used in adversarial contexts. This paper presents a novel adversarial attack, \texttt{ToxicTrap}, introducing small word-level perturbations to fool SOTA text classifiers to predict toxic text samples as benign. ToxicTrap exploits greedy based search strategies to enable fast and effective generation of toxic adversarial examples. Two novel goal function designs allow ToxicTrap to identify weaknesses in both multiclass and multilabel toxic language detectors. Our empirical results show that SOTA toxicity text classifiers are indeed vulnerable to the proposed attacks, attaining over 98\% attack success rates in multilabel cases. We also show how a vanilla adversarial training and its improved version can help increase robustness of a toxicity detector even against unseen attacks.

Related papers

Unveiling Covert Toxicity in Multimodal Data via Toxicity Association Graphs: A Graph-Based Metric and Interpretable Detection Framework [58.01529356381494]
We propose a novel detection framework based on Toxicity Association Graphs (TAGs)<n>We introduce the first quantifiable metric for hidden toxicity, the Multimodal Toxicity Covertness (MTC)<n>Our approach enables precise identification of covert toxicity while preserving full interpretability of the decision-making process.
arXiv Detail & Related papers (2026-02-03T08:54:25Z)
Rethinking Toxicity Evaluation in Large Language Models: A Multi-Label Perspective [104.09817371557476]
Large language models (LLMs) have achieved impressive results across a range of natural language processing tasks.<n>Their potential to generate harmful content has raised serious safety concerns.<n>We introduce three novel multi-label benchmarks for toxicity detection.
arXiv Detail & Related papers (2025-10-16T06:50:33Z)
TaeBench: Improving Quality of Toxic Adversarial Examples [10.768188905349874]
This paper proposes an annotation pipeline for quality control of generated toxic adversarial examples (TAE) We design model-based automated annotation and human-based quality verification to assess the quality requirements of TAE. We show that TaeBench can effectively transfer-attack SOTA toxicity content moderation models and services.
arXiv Detail & Related papers (2024-10-08T00:14:27Z)
Unveiling the Implicit Toxicity in Large Language Models [77.90933074675543]
The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use. We show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting. We propose a reinforcement learning (RL) based attacking method to further induce the implicit toxicity in LLMs.
arXiv Detail & Related papers (2023-11-29T06:42:36Z)
ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection [33.715318646717385]
ToxiGen is a large-scale dataset of 274k toxic and benign statements about 13 minority groups. Controlling machine generation in this way allows ToxiGen to cover implicitly toxic text at a larger scale. We find that 94.5% of toxic examples are labeled as hate speech by human annotators.
arXiv Detail & Related papers (2022-03-17T17:57:56Z)
Toxicity Detection can be Sensitive to the Conversational Context [64.28043776806213]
We construct and publicly release a dataset of 10,000 posts with two kinds of toxicity labels. We introduce a new task, context sensitivity estimation, which aims to identify posts whose perceived toxicity changes if the context is also considered.
arXiv Detail & Related papers (2021-11-19T13:57:26Z)
Putting words into the system's mouth: A targeted attack on neural machine translation using monolingual data poisoning [50.67997309717586]
We propose a poisoning attack in which a malicious adversary inserts a small poisoned sample of monolingual text into the training set of a system trained using back-translation. This sample is designed to induce a specific, targeted translation behaviour, such as peddling misinformation. We present two methods for crafting poisoned examples, and show that only a tiny handful of instances, amounting to only 0.02% of the training set, is sufficient to enact a successful attack.
arXiv Detail & Related papers (2021-07-12T08:07:09Z)
Mitigating Biases in Toxic Language Detection through Invariant Rationalization [70.36701068616367]
biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection. We propose to use invariant rationalization (InvRat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns. Our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.
arXiv Detail & Related papers (2021-06-14T08:49:52Z)
Challenges in Automated Debiasing for Toxic Language Detection [81.04406231100323]
Biased associations have been a challenge in the development of classifiers for detecting toxic language. We investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection. Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English)
arXiv Detail & Related papers (2021-01-29T22:03:17Z)
Fortifying Toxic Speech Detectors Against Veiled Toxicity [38.20984369410193]
We propose a framework aimed at fortifying existing toxic speech detectors without a large labeled corpus of veiled toxicity. Just a handful of probing examples are used to surface orders of magnitude more disguised offenses.
arXiv Detail & Related papers (2020-10-07T04:43:48Z)
Poison Attacks against Text Datasets with Conditional Adversarially Regularized Autoencoder [78.01180944665089]
This paper demonstrates a fatal vulnerability in natural language inference (NLI) and text classification systems. We present a 'backdoor poisoning' attack on NLP models.
arXiv Detail & Related papers (2020-10-06T13:03:49Z)
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.