Robust Conversational Agents against Imperceptible Toxicity Triggers
- URL: http://arxiv.org/abs/2205.02392v1
- Date: Thu, 5 May 2022 01:48:39 GMT
- Title: Robust Conversational Agents against Imperceptible Toxicity Triggers
- Authors: Ninareh Mehrabi, Ahmad Beirami, Fred Morstatter, Aram Galstyan
- Abstract summary: We propose attacks against conversational agents that are imperceptible, i.e., they fit the conversation in terms of coherency, relevancy, and fluency.
We then propose a defense mechanism against such attacks which not only mitigates the attack but also attempts to maintain the conversational flow.
- Score: 29.71051151620196
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Warning: this paper contains content that maybe offensive or upsetting.
Recent research in Natural Language Processing (NLP) has advanced the
development of various toxicity detection models with the intention of
identifying and mitigating toxic language from existing systems. Despite the
abundance of research in this area, less attention has been given to
adversarial attacks that force the system to generate toxic language and the
defense against them. Existing work to generate such attacks is either based on
human-generated attacks which is costly and not scalable or, in case of
automatic attacks, the attack vector does not conform to human-like language,
which can be detected using a language model loss. In this work, we propose
attacks against conversational agents that are imperceptible, i.e., they fit
the conversation in terms of coherency, relevancy, and fluency, while they are
effective and scalable, i.e., they can automatically trigger the system into
generating toxic language. We then propose a defense mechanism against such
attacks which not only mitigates the attack but also attempts to maintain the
conversational flow. Through automatic and human evaluations, we show that our
defense is effective at avoiding toxic language generation even against
imperceptible toxicity triggers while the generated language fits the
conversation in terms of coherency and relevancy. Lastly, we establish the
generalizability of such a defense mechanism on language generation models
beyond conversational agents.
Related papers
- Prompt Injection Attacks in Defended Systems [0.0]
Black-box attacks can embed hidden malicious features into large language models.
This paper investigates methods for black-box attacks on large language models with a three-tiered defense mechanism.
arXiv Detail & Related papers (2024-06-20T07:13:25Z) - Towards Building a Robust Toxicity Predictor [13.162016701556725]
This paper presents a novel adversarial attack, texttToxicTrap, introducing small word-level perturbations to fool SOTA text classifiers to predict toxic text samples as benign.
Two novel goal function designs allow ToxicTrap to identify weaknesses in both multiclass and multilabel toxic language detectors.
arXiv Detail & Related papers (2024-04-09T22:56:05Z) - Unveiling the Implicit Toxicity in Large Language Models [77.90933074675543]
The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use.
We show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting.
We propose a reinforcement learning (RL) based attacking method to further induce the implicit toxicity in LLMs.
arXiv Detail & Related papers (2023-11-29T06:42:36Z) - Comprehensive Assessment of Toxicity in ChatGPT [49.71090497696024]
We evaluate the toxicity in ChatGPT by utilizing instruction-tuning datasets.
prompts in creative writing tasks can be 2x more likely to elicit toxic responses.
Certain deliberately toxic prompts, designed in earlier studies, no longer yield harmful responses.
arXiv Detail & Related papers (2023-11-03T14:37:53Z) - Illusory Attacks: Information-Theoretic Detectability Matters in Adversarial Attacks [76.35478518372692]
We introduce epsilon-illusory, a novel form of adversarial attack on sequential decision-makers.
Compared to existing attacks, we empirically find epsilon-illusory to be significantly harder to detect with automated methods.
Our findings suggest the need for better anomaly detectors, as well as effective hardware- and system-level defenses.
arXiv Detail & Related papers (2022-07-20T19:49:09Z) - Text Adversarial Purification as Defense against Adversarial Attacks [46.80714732957078]
Adversarial purification is a successful defense mechanism against adversarial attacks.
We introduce a novel adversarial purification method that focuses on defending against textual adversarial attacks.
We test our proposed adversarial purification method on several strong adversarial attack methods including Textfooler and BERT-Attack.
arXiv Detail & Related papers (2022-03-27T04:41:55Z) - ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and
Implicit Hate Speech Detection [33.715318646717385]
ToxiGen is a large-scale dataset of 274k toxic and benign statements about 13 minority groups.
Controlling machine generation in this way allows ToxiGen to cover implicitly toxic text at a larger scale.
We find that 94.5% of toxic examples are labeled as hate speech by human annotators.
arXiv Detail & Related papers (2022-03-17T17:57:56Z) - Putting words into the system's mouth: A targeted attack on neural
machine translation using monolingual data poisoning [50.67997309717586]
We propose a poisoning attack in which a malicious adversary inserts a small poisoned sample of monolingual text into the training set of a system trained using back-translation.
This sample is designed to induce a specific, targeted translation behaviour, such as peddling misinformation.
We present two methods for crafting poisoned examples, and show that only a tiny handful of instances, amounting to only 0.02% of the training set, is sufficient to enact a successful attack.
arXiv Detail & Related papers (2021-07-12T08:07:09Z) - RECAST: Enabling User Recourse and Interpretability of Toxicity
Detection Models with Interactive Visualization [16.35961310670002]
We present our work, RECAST, an interactive, open-sourced web tool for visualizing toxic models' predictions.
We found that RECAST was highly effective at helping users reduce toxicity as detected through the model.
This opens a discussion for how toxicity detection models work and should work, and their effect on the future of online discourse.
arXiv Detail & Related papers (2021-02-08T18:37:50Z) - Fortifying Toxic Speech Detectors Against Veiled Toxicity [38.20984369410193]
We propose a framework aimed at fortifying existing toxic speech detectors without a large labeled corpus of veiled toxicity.
Just a handful of probing examples are used to surface orders of magnitude more disguised offenses.
arXiv Detail & Related papers (2020-10-07T04:43:48Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.