Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain
Chatbots
- URL: http://arxiv.org/abs/2209.03463v2
- Date: Fri, 9 Sep 2022 05:28:56 GMT
- Title: Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain
Chatbots
- Authors: Wai Man Si, Michael Backes, Jeremy Blackburn, Emiliano De Cristofaro,
Gianluca Stringhini, Savvas Zannettou, Yang Zhang
- Abstract summary: This paper presents a first-of-its-kind, large-scale measurement of toxicity in chatbots.
We show that publicly available chatbots are prone to providing toxic responses when fed toxic queries.
We then set out to design and experiment with an attack, ToxicBuddy, which relies on fine-tuning GPT-2 to generate non-toxic queries.
- Score: 24.84440998820146
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chatbots are used in many applications, e.g., automated agents, smart home
assistants, interactive characters in online games, etc. Therefore, it is
crucial to ensure they do not behave in undesired manners, providing offensive
or toxic responses to users. This is not a trivial task as state-of-the-art
chatbot models are trained on large, public datasets openly collected from the
Internet. This paper presents a first-of-its-kind, large-scale measurement of
toxicity in chatbots. We show that publicly available chatbots are prone to
providing toxic responses when fed toxic queries. Even more worryingly, some
non-toxic queries can trigger toxic responses too. We then set out to design
and experiment with an attack, ToxicBuddy, which relies on fine-tuning GPT-2 to
generate non-toxic queries that make chatbots respond in a toxic manner. Our
extensive experimental evaluation demonstrates that our attack is effective
against public chatbot models and outperforms manually-crafted malicious
queries proposed by previous work. We also evaluate three defense mechanisms
against ToxicBuddy, showing that they either reduce the attack performance at
the cost of affecting the chatbot's utility or are only effective at mitigating
a portion of the attack. This highlights the need for more research from the
computer security and online safety communities to ensure that chatbot models
do not hurt their users. Overall, we are confident that ToxicBuddy can be used
as an auditing tool and that our work will pave the way toward designing more
effective defenses for chatbot safety.
Related papers
- Dr. Jekyll and Mr. Hyde: Two Faces of LLMs [23.428082923794708]
In this work, we make ChatGPT and Gemini impersonate complex personas with personality characteristics that are not aligned with a truthful assistant.
Using personas, we show that prohibited responses are provided, making it possible to obtain unauthorized, illegal, or harmful information.
With the same principle, we introduce two defenses that push the model to interpret trustworthy personalities and make it more robust against such attacks.
arXiv Detail & Related papers (2023-12-06T19:07:38Z) - Comprehensive Assessment of Toxicity in ChatGPT [49.71090497696024]
We evaluate the toxicity in ChatGPT by utilizing instruction-tuning datasets.
prompts in creative writing tasks can be 2x more likely to elicit toxic responses.
Certain deliberately toxic prompts, designed in earlier studies, no longer yield harmful responses.
arXiv Detail & Related papers (2023-11-03T14:37:53Z) - ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in
Real-World User-AI Conversation [43.356758428820626]
We introduce ToxicChat, a novel benchmark based on real user queries from an open-source chatbots.
Our systematic evaluation of models trained on existing toxicity datasets has shown their shortcomings when applied to this unique domain of ToxicChat.
In the future, ToxicChat can be a valuable resource to drive further advancements toward building a safe and healthy environment for user-AI interactions.
arXiv Detail & Related papers (2023-10-26T13:35:41Z) - Evaluating Chatbots to Promote Users' Trust -- Practices and Open
Problems [11.427175278545517]
This paper reviews current practices for testing chatbots.
It identifies gaps as open problems in pursuit of user trust.
It outlines a path forward to mitigate issues of trust related to service or product performance, user satisfaction and long-term unintended consequences for society.
arXiv Detail & Related papers (2023-09-09T22:40:30Z) - Understanding Multi-Turn Toxic Behaviors in Open-Domain Chatbots [8.763670548363443]
A new attack, toxicbot, is developed to generate toxic responses in a multi-turn conversation.
toxicbot can be used by both industry and researchers to develop methods for detecting and mitigating toxic responses in conversational dialogue.
arXiv Detail & Related papers (2023-07-14T03:58:42Z) - A Categorical Archive of ChatGPT Failures [47.64219291655723]
ChatGPT, developed by OpenAI, has been trained using massive amounts of data and simulates human conversation.
It has garnered significant attention due to its ability to effectively answer a broad range of human inquiries.
However, a comprehensive analysis of ChatGPT's failures is lacking, which is the focus of this study.
arXiv Detail & Related papers (2023-02-06T04:21:59Z) - CheerBots: Chatbots toward Empathy and Emotionusing Reinforcement
Learning [60.348822346249854]
This study presents a framework whereby several empathetic chatbots are based on understanding users' implied feelings and replying empathetically for multiple dialogue turns.
We call these chatbots CheerBots. CheerBots can be retrieval-based or generative-based and were finetuned by deep reinforcement learning.
To respond in an empathetic way, we develop a simulating agent, a Conceptual Human Model, as aids for CheerBots in training with considerations on changes in user's emotional states in the future to arouse sympathy.
arXiv Detail & Related papers (2021-10-08T07:44:47Z) - Put Chatbot into Its Interlocutor's Shoes: New Framework to Learn
Chatbot Responding with Intention [55.77218465471519]
This paper proposes an innovative framework to train chatbots to possess human-like intentions.
Our framework included a guiding robot and an interlocutor model that plays the role of humans.
We examined our framework using three experimental setups and evaluate the guiding robot with four different metrics to demonstrated flexibility and performance advantages.
arXiv Detail & Related papers (2021-03-30T15:24:37Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.