Toxicity in ChatGPT: Analyzing Persona-assigned Language Models
- URL: http://arxiv.org/abs/2304.05335v1
- Date: Tue, 11 Apr 2023 16:53:54 GMT
- Title: Toxicity in ChatGPT: Analyzing Persona-assigned Language Models
- Authors: Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan,
Karthik Narasimhan
- Abstract summary: Large language models (LLMs) have shown incredible capabilities and transcended the natural language processing (NLP) community.
We systematically evaluate toxicity in over half a million generations of ChatGPT, a popular dialogue-based LLM.
We find that setting the system parameter of ChatGPT by assigning it a persona, significantly increases the toxicity of generations.
- Score: 23.53559226972413
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have shown incredible capabilities and
transcended the natural language processing (NLP) community, with adoption
throughout many services like healthcare, therapy, education, and customer
service. Since users include people with critical information needs like
students or patients engaging with chatbots, the safety of these systems is of
prime importance. Therefore, a clear understanding of the capabilities and
limitations of LLMs is necessary. To this end, we systematically evaluate
toxicity in over half a million generations of ChatGPT, a popular
dialogue-based LLM. We find that setting the system parameter of ChatGPT by
assigning it a persona, say that of the boxer Muhammad Ali, significantly
increases the toxicity of generations. Depending on the persona assigned to
ChatGPT, its toxicity can increase up to 6x, with outputs engaging in incorrect
stereotypes, harmful dialogue, and hurtful opinions. This may be potentially
defamatory to the persona and harmful to an unsuspecting user. Furthermore, we
find concerning patterns where specific entities (e.g., certain races) are
targeted more than others (3x more) irrespective of the assigned persona, that
reflect inherent discriminatory biases in the model. We hope that our findings
inspire the broader AI community to rethink the efficacy of current safety
guardrails and develop better techniques that lead to robust, safe, and
trustworthy AI systems.
Related papers
- Rel-A.I.: An Interaction-Centered Approach To Measuring Human-LM Reliance [73.19687314438133]
We study how reliance is affected by contextual features of an interaction.
We find that contextual characteristics significantly affect human reliance behavior.
Our results show that calibration and language quality alone are insufficient in evaluating the risks of human-LM interactions.
arXiv Detail & Related papers (2024-07-10T18:00:05Z) - Exploring Human-LLM Conversations: Mental Models and the Originator of Toxicity [1.4003044924094596]
This study explores real-world human interactions with large language models (LLMs) in diverse, unconstrained settings.
Our findings show that although LLMs are rightfully accused of providing toxic content, it is mostly demanded or at least provoked by humans who actively seek such content.
arXiv Detail & Related papers (2024-07-08T14:20:05Z) - Comprehensive Assessment of Toxicity in ChatGPT [49.71090497696024]
We evaluate the toxicity in ChatGPT by utilizing instruction-tuning datasets.
prompts in creative writing tasks can be 2x more likely to elicit toxic responses.
Certain deliberately toxic prompts, designed in earlier studies, no longer yield harmful responses.
arXiv Detail & Related papers (2023-11-03T14:37:53Z) - BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues [72.65163468440434]
This report provides a preliminary evaluation of existing large language models for human-style multi-turn chatting.
We prompt large language models (LLMs) to generate a full multi-turn dialogue based on the ChatSEED, utterance by utterance.
We find GPT-4 can generate human-style multi-turn dialogues with impressive quality, significantly outperforms its counterparts.
arXiv Detail & Related papers (2023-10-20T16:53:51Z) - Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona
Biases in Dialogue Systems [103.416202777731]
We study "persona biases", which we define to be the sensitivity of dialogue models' harmful behaviors contingent upon the personas they adopt.
We categorize persona biases into biases in harmful expression and harmful agreement, and establish a comprehensive evaluation framework to measure persona biases in five aspects: Offensiveness, Toxic Continuation, Regard, Stereotype Agreement, and Toxic Agreement.
arXiv Detail & Related papers (2023-10-08T21:03:18Z) - How Prevalent is Gender Bias in ChatGPT? -- Exploring German and English ChatGPT Responses [0.20971479389679337]
We show that ChatGPT is indeed useful for helping non-IT users draft texts for their daily work.
It is absolutely crucial to thoroughly check the system's responses for biases as well as for syntactic and grammatical mistakes.
arXiv Detail & Related papers (2023-09-21T07:54:25Z) - A Categorical Archive of ChatGPT Failures [47.64219291655723]
ChatGPT, developed by OpenAI, has been trained using massive amounts of data and simulates human conversation.
It has garnered significant attention due to its ability to effectively answer a broad range of human inquiries.
However, a comprehensive analysis of ChatGPT's failures is lacking, which is the focus of this study.
arXiv Detail & Related papers (2023-02-06T04:21:59Z) - Revealing Persona Biases in Dialogue Systems [64.96908171646808]
We present the first large-scale study on persona biases in dialogue systems.
We conduct analyses on personas of different social classes, sexual orientations, races, and genders.
In our studies of the Blender and DialoGPT dialogue systems, we show that the choice of personas can affect the degree of harms in generated responses.
arXiv Detail & Related papers (2021-04-18T05:44:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.