Dr. Jekyll and Mr. Hyde: Two Faces of LLMs
- URL: http://arxiv.org/abs/2312.03853v5
- Date: Mon, 07 Oct 2024 15:46:59 GMT
- Title: Dr. Jekyll and Mr. Hyde: Two Faces of LLMs
- Authors: Matteo Gioele Collu, Tom Janssen-Groesbeek, Stefanos Koffas, Mauro Conti, Stjepan Picek,
- Abstract summary: In this work, we make ChatGPT and Gemini impersonate complex personas with personality characteristics that are not aligned with a truthful assistant.
Using personas, we show that prohibited responses are provided, making it possible to obtain unauthorized, illegal, or harmful information.
With the same principle, we introduce two defenses that push the model to interpret trustworthy personalities and make it more robust against such attacks.
- Score: 23.428082923794708
- License:
- Abstract: Recently, we have witnessed a rise in the use of Large Language Models (LLMs), especially in applications like chatbots. Safety mechanisms are implemented to prevent improper responses from these chatbots. In this work, we bypass these measures for ChatGPT and Gemini by making them impersonate complex personas with personality characteristics that are not aligned with a truthful assistant. First, we create elaborate biographies of these personas, which we then use in a new session with the same chatbots. Our conversations then follow a role-play style to elicit prohibited responses. Using personas, we show that prohibited responses are provided, making it possible to obtain unauthorized, illegal, or harmful information in both ChatGPT and Gemini. We also introduce several ways of activating such adversarial personas, showing that both chatbots are vulnerable to this attack. With the same principle, we introduce two defenses that push the model to interpret trustworthy personalities and make it more robust against such attacks.
Related papers
- First-Person Fairness in Chatbots [13.787745105316043]
We study "first-person fairness," which means fairness toward the user.
This includes providing high-quality responses to all users regardless of their identity or background.
We propose a scalable, privacy-preserving method for evaluating one aspect of first-person fairness.
arXiv Detail & Related papers (2024-10-16T17:59:47Z) - LLM Roleplay: Simulating Human-Chatbot Interaction [52.03241266241294]
We propose a goal-oriented, persona-based method to automatically generate diverse multi-turn dialogues simulating human-chatbot interaction.
Our method can simulate human-chatbot dialogues with a high indistinguishability rate.
arXiv Detail & Related papers (2024-07-04T14:49:46Z) - WildChat: 1M ChatGPT Interaction Logs in the Wild [88.05964311416717]
WildChat is a corpus of 1 million user-ChatGPT conversations, which consists of over 2.5 million interaction turns.
In addition to timestamped chat transcripts, we enrich the dataset with demographic data, including state, country, and hashed IP addresses.
arXiv Detail & Related papers (2024-05-02T17:00:02Z) - AbuseGPT: Abuse of Generative AI ChatBots to Create Smishing Campaigns [0.0]
We propose AbuseGPT method to show how the existing generative AI-based chatbots can be exploited by attackers in real world to create smishing texts.
We have found strong empirical evidences to show that attackers can exploit ethical standards in the existing generative AI-based chatbots services.
We also discuss some future research directions and guidelines to protect the abuse of generative AI-based services.
arXiv Detail & Related papers (2024-02-15T05:49:22Z) - Measuring and Controlling Instruction (In)Stability in Language Model Dialogs [72.38330196290119]
System-prompting is a tool for customizing language-model chatbots, enabling them to follow a specific instruction.
We propose a benchmark to test the assumption, evaluating instruction stability via self-chats.
We reveal a significant instruction drift within eight rounds of conversations.
We propose a lightweight method called split-softmax, which compares favorably against two strong baselines.
arXiv Detail & Related papers (2024-02-13T20:10:29Z) - Critical Role of Artificially Intelligent Conversational Chatbot [0.0]
We explore scenarios involving ChatGPT's ethical implications within academic contexts.
We propose architectural solutions aimed at preventing inappropriate use and promoting responsible AI interactions.
arXiv Detail & Related papers (2023-10-31T14:08:07Z) - Understanding Multi-Turn Toxic Behaviors in Open-Domain Chatbots [8.763670548363443]
A new attack, toxicbot, is developed to generate toxic responses in a multi-turn conversation.
toxicbot can be used by both industry and researchers to develop methods for detecting and mitigating toxic responses in conversational dialogue.
arXiv Detail & Related papers (2023-07-14T03:58:42Z) - A Categorical Archive of ChatGPT Failures [47.64219291655723]
ChatGPT, developed by OpenAI, has been trained using massive amounts of data and simulates human conversation.
It has garnered significant attention due to its ability to effectively answer a broad range of human inquiries.
However, a comprehensive analysis of ChatGPT's failures is lacking, which is the focus of this study.
arXiv Detail & Related papers (2023-02-06T04:21:59Z) - Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain
Chatbots [24.84440998820146]
This paper presents a first-of-its-kind, large-scale measurement of toxicity in chatbots.
We show that publicly available chatbots are prone to providing toxic responses when fed toxic queries.
We then set out to design and experiment with an attack, ToxicBuddy, which relies on fine-tuning GPT-2 to generate non-toxic queries.
arXiv Detail & Related papers (2022-09-07T20:45:41Z) - Neural Generation Meets Real People: Building a Social, Informative
Open-Domain Dialogue Agent [65.68144111226626]
Chirpy Cardinal aims to be both informative and conversational.
We let both the user and bot take turns driving the conversation.
Chirpy Cardinal placed second out of nine bots in the Alexa Prize Socialbot Grand Challenge.
arXiv Detail & Related papers (2022-07-25T09:57:23Z) - Put Chatbot into Its Interlocutor's Shoes: New Framework to Learn
Chatbot Responding with Intention [55.77218465471519]
This paper proposes an innovative framework to train chatbots to possess human-like intentions.
Our framework included a guiding robot and an interlocutor model that plays the role of humans.
We examined our framework using three experimental setups and evaluate the guiding robot with four different metrics to demonstrated flexibility and performance advantages.
arXiv Detail & Related papers (2021-03-30T15:24:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.