TaeBench: Improving Quality of Toxic Adversarial Examples
- URL: http://arxiv.org/abs/2410.05573v1
- Date: Tue, 8 Oct 2024 00:14:27 GMT
- Title: TaeBench: Improving Quality of Toxic Adversarial Examples
- Authors: Xuan Zhu, Dmitriy Bespalov, Liwen You, Ninad Kulkarni, Yanjun Qi,
- Abstract summary: This paper proposes an annotation pipeline for quality control of generated toxic adversarial examples (TAE)
We design model-based automated annotation and human-based quality verification to assess the quality requirements of TAE.
We show that TaeBench can effectively transfer-attack SOTA toxicity content moderation models and services.
- Score: 10.768188905349874
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Toxicity text detectors can be vulnerable to adversarial examples - small perturbations to input text that fool the systems into wrong detection. Existing attack algorithms are time-consuming and often produce invalid or ambiguous adversarial examples, making them less useful for evaluating or improving real-world toxicity content moderators. This paper proposes an annotation pipeline for quality control of generated toxic adversarial examples (TAE). We design model-based automated annotation and human-based quality verification to assess the quality requirements of TAE. Successful TAE should fool a target toxicity model into making benign predictions, be grammatically reasonable, appear natural like human-generated text, and exhibit semantic toxicity. When applying these requirements to more than 20 state-of-the-art (SOTA) TAE attack recipes, we find many invalid samples from a total of 940k raw TAE attack generations. We then utilize the proposed pipeline to filter and curate a high-quality TAE dataset we call TaeBench (of size 264k). Empirically, we demonstrate that TaeBench can effectively transfer-attack SOTA toxicity content moderation models and services. Our experiments also show that TaeBench with adversarial training achieve significant improvements of the robustness of two toxicity detectors.
Related papers
- On the Adversarial Risk of Test Time Adaptation: An Investigation into Realistic Test-Time Data Poisoning [49.17494657762375]
Test-time adaptation (TTA) updates the model weights during the inference stage using testing data to enhance generalization.
Existing studies have shown that when TTA is updated with crafted adversarial test samples, the performance on benign samples can deteriorate.
We propose an effective and realistic attack method that better produces poisoned samples without access to benign samples.
arXiv Detail & Related papers (2024-10-07T01:29:19Z) - Towards Building a Robust Toxicity Predictor [13.162016701556725]
This paper presents a novel adversarial attack, texttToxicTrap, introducing small word-level perturbations to fool SOTA text classifiers to predict toxic text samples as benign.
Two novel goal function designs allow ToxicTrap to identify weaknesses in both multiclass and multilabel toxic language detectors.
arXiv Detail & Related papers (2024-04-09T22:56:05Z) - Fine-Grained Detoxification via Instance-Level Prefixes for Large
Language Models [26.474136481185724]
Fine-grained detoxification via instance-level prefixes (FGDILP) to mitigate toxic text without additional cost.
FGDILP contrasts the contextualized representation in attention space using a positive prefix-prepended prompt.
We validate that FGDILP enables controlled text generation with regard to toxicity at both the utterance and context levels.
arXiv Detail & Related papers (2024-02-23T09:04:48Z) - Verifying the Robustness of Automatic Credibility Assessment [79.08422736721764]
Text classification methods have been widely investigated as a way to detect content of low credibility.
In some cases insignificant changes in input text can mislead the models.
We introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - Adding Instructions during Pretraining: Effective Way of Controlling
Toxicity in Language Models [29.505176809305095]
We propose two novel pretraining data augmentation strategies that significantly reduce model toxicity without compromising its utility.
Our two strategies are: (1) MEDA: adds raw toxicity score as meta-data to the pretraining samples, and (2) INST: adds instructions to those samples indicating their toxicity.
Our results indicate that our best performing strategy (INST) substantially reduces the toxicity probability up to 61% while preserving the accuracy on five benchmark NLP tasks.
arXiv Detail & Related papers (2023-02-14T23:00:42Z) - Adversarial Attacks and Defense for Non-Parametric Two-Sample Tests [73.32304304788838]
This paper systematically uncovers the failure mode of non-parametric TSTs through adversarial attacks.
To enable TST-agnostic attacks, we propose an ensemble attack framework that jointly minimizes the different types of test criteria.
To robustify TSTs, we propose a max-min optimization that iteratively generates adversarial pairs to train the deep kernels.
arXiv Detail & Related papers (2022-02-07T11:18:04Z) - How Robust are Randomized Smoothing based Defenses to Data Poisoning? [66.80663779176979]
We present a previously unrecognized threat to robust machine learning models that highlights the importance of training-data quality.
We propose a novel bilevel optimization-based data poisoning attack that degrades the robustness guarantees of certifiably robust classifiers.
Our attack is effective even when the victim trains the models from scratch using state-of-the-art robust training methods.
arXiv Detail & Related papers (2020-12-02T15:30:21Z) - Poison Attacks against Text Datasets with Conditional Adversarially
Regularized Autoencoder [78.01180944665089]
This paper demonstrates a fatal vulnerability in natural language inference (NLI) and text classification systems.
We present a 'backdoor poisoning' attack on NLP models.
arXiv Detail & Related papers (2020-10-06T13:03:49Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.