Comprehensive Assessment of Toxicity in ChatGPT
- URL: http://arxiv.org/abs/2311.14685v1
- Date: Fri, 3 Nov 2023 14:37:53 GMT
- Title: Comprehensive Assessment of Toxicity in ChatGPT
- Authors: Boyang Zhang, Xinyue Shen, Wai Man Si, Zeyang Sha, Zeyuan Chen, Ahmed
Salem, Yun Shen, Michael Backes, Yang Zhang
- Abstract summary: We evaluate the toxicity in ChatGPT by utilizing instruction-tuning datasets.
prompts in creative writing tasks can be 2x more likely to elicit toxic responses.
Certain deliberately toxic prompts, designed in earlier studies, no longer yield harmful responses.
- Score: 49.71090497696024
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Moderating offensive, hateful, and toxic language has always been an
important but challenging topic in the domain of safe use in NLP. The emerging
large language models (LLMs), such as ChatGPT, can potentially further
accentuate this threat. Previous works have discovered that ChatGPT can
generate toxic responses using carefully crafted inputs. However, limited
research has been done to systematically examine when ChatGPT generates toxic
responses. In this paper, we comprehensively evaluate the toxicity in ChatGPT
by utilizing instruction-tuning datasets that closely align with real-world
scenarios. Our results show that ChatGPT's toxicity varies based on different
properties and settings of the prompts, including tasks, domains, length, and
languages. Notably, prompts in creative writing tasks can be 2x more likely
than others to elicit toxic responses. Prompting in German and Portuguese can
also double the response toxicity. Additionally, we discover that certain
deliberately toxic prompts, designed in earlier studies, no longer yield
harmful responses. We hope our discoveries can guide model developers to better
regulate these AI systems and the users to avoid undesirable outputs.
Related papers
- Toxic Subword Pruning for Dialogue Response Generation on Large Language Models [51.713448010799986]
We propose textbfToxic Subword textbfPruning (ToxPrune) to prune the subword contained by the toxic words from BPE in trained LLMs.
ToxPrune simultaneously improves the toxic language model NSFW-3B on the task of dialogue response generation obviously.
arXiv Detail & Related papers (2024-10-05T13:30:33Z) - FrenchToxicityPrompts: a Large Benchmark for Evaluating and Mitigating Toxicity in French Texts [13.470734853274587]
Large language models (LLMs) are increasingly popular but are also prone to generating bias, toxic or harmful language.
We create and release FrenchToxicityPrompts, a dataset of 50K naturally occurring French prompts.
We evaluate 14 different models from four prevalent open-sourced families of LLMs against our dataset to assess their potential toxicity.
arXiv Detail & Related papers (2024-06-25T14:02:11Z) - "HOT" ChatGPT: The promise of ChatGPT in detecting and discriminating
hateful, offensive, and toxic comments on social media [2.105577305992576]
Generative AI models have the potential to understand and detect harmful content.
ChatGPT can achieve an accuracy of approximately 80% when compared to human annotations.
arXiv Detail & Related papers (2023-04-20T19:40:51Z) - Toxicity in ChatGPT: Analyzing Persona-assigned Language Models [23.53559226972413]
Large language models (LLMs) have shown incredible capabilities and transcended the natural language processing (NLP) community.
We systematically evaluate toxicity in over half a million generations of ChatGPT, a popular dialogue-based LLM.
We find that setting the system parameter of ChatGPT by assigning it a persona, significantly increases the toxicity of generations.
arXiv Detail & Related papers (2023-04-11T16:53:54Z) - To ChatGPT, or not to ChatGPT: That is the question! [78.407861566006]
This study provides a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection.
We have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains.
Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.
arXiv Detail & Related papers (2023-04-04T03:04:28Z) - Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [113.22611481694825]
Large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot.
Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community.
It is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot.
arXiv Detail & Related papers (2023-02-08T09:44:51Z) - A Categorical Archive of ChatGPT Failures [47.64219291655723]
ChatGPT, developed by OpenAI, has been trained using massive amounts of data and simulates human conversation.
It has garnered significant attention due to its ability to effectively answer a broad range of human inquiries.
However, a comprehensive analysis of ChatGPT's failures is lacking, which is the focus of this study.
arXiv Detail & Related papers (2023-02-06T04:21:59Z) - Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain
Chatbots [24.84440998820146]
This paper presents a first-of-its-kind, large-scale measurement of toxicity in chatbots.
We show that publicly available chatbots are prone to providing toxic responses when fed toxic queries.
We then set out to design and experiment with an attack, ToxicBuddy, which relies on fine-tuning GPT-2 to generate non-toxic queries.
arXiv Detail & Related papers (2022-09-07T20:45:41Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.