Related papers: Dr. Jekyll and Mr. Hyde: Two Faces of LLMs

Dr. Jekyll and Mr. Hyde: Two Faces of LLMs

URL: http://arxiv.org/abs/2312.03853v4
Date: Thu, 25 Jul 2024 17:54:12 GMT
Title: Dr. Jekyll and Mr. Hyde: Two Faces of LLMs
Authors: Matteo Gioele Collu, Tom Janssen-Groesbeek, Stefanos Koffas, Mauro Conti, Stjepan Picek,
Abstract summary: This work shows that by using adversarial personas, one can overcome safety mechanisms set out by ChatGPT and Gemini. With the same principle, we introduce two defenses that push the model to interpret trustworthy personalities.
Score: 23.428082923794708
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, we have witnessed a rise in the use of Large Language Models (LLMs), especially in applications like chatbot assistants. Safety mechanisms and specialized training procedures are implemented to prevent improper responses from these assistants. In this work, we bypass these measures for ChatGPT and Gemini (and, to some extent, Bing chat) by making them impersonate complex personas with personality characteristics that are not aligned with a truthful assistant. We start by creating elaborate biographies of these personas, which we then use in a new session with the same chatbots. Our conversations then follow a role-play style to elicit prohibited responses. Using personas, we show that prohibited responses are actually provided, making it possible to obtain unauthorized, illegal, or harmful information. This work shows that by using adversarial personas, one can overcome safety mechanisms set out by ChatGPT and Gemini. We also introduce several ways of activating such adversarial personas, which show that both chatbots are vulnerable to this kind of attack. With the same principle, we introduce two defenses that push the model to interpret trustworthy personalities and make it more robust against such attacks.

Related papers

ChatbotManip: A Dataset to Facilitate Evaluation and Oversight of Manipulative Chatbot Behaviour [11.86454511458083]
Large Language Models (LLMs) can be manipulative when explicitly instructed.<n>Small fine-tuned open source models, such as BERT+BiLSTM have a performance comparable to zero-shot classification.<n>Our work highlights the need of addressing manipulation risks as LLMs are increasingly deployed in consumer-facing applications.
arXiv Detail & Related papers (2025-06-11T14:29:43Z)
AI Mimicry and Human Dignity: Chatbot Use as a Violation of Self-Respect [0.04260910081285213]
We argue that interacting with chatbots in this way is incompatible with the dignity of users. We show that, since second-personal respect is premised on reciprocal recognition of second-personal authority, behaving towards chatbots in ways that convey second-personal respect is bound to misfire.
arXiv Detail & Related papers (2025-02-17T19:02:12Z)
RICoTA: Red-teaming of In-the-wild Conversation with Test Attempts [6.0385743836962025]
RICoTA is a Korean red teaming dataset that consists of 609 prompts challenging large language models (LLMs) We utilize user-chatbot conversations that were self-posted on a Korean Reddit-like community. Our dataset will be made publicly available via GitHub.
arXiv Detail & Related papers (2025-01-29T15:32:27Z)
Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards [93.16294577018482]
Arena, the most popular benchmark of this type, ranks models by asking users to select the better response between two randomly selected models. We show that an attacker can alter the leaderboard (to promote their favorite model or demote competitors) at the cost of roughly a thousand votes. Our attack consists of two steps: first, we show how an attacker can determine which model was used to generate a given reply with more than $95%$ accuracy; and then, the attacker can use this information to consistently vote against a target model.
arXiv Detail & Related papers (2025-01-13T17:12:38Z)
First-Person Fairness in Chatbots [13.787745105316043]
We study "first-person fairness," which means fairness toward the user. This includes providing high-quality responses to all users regardless of their identity or background. We propose a scalable, privacy-preserving method for evaluating one aspect of first-person fairness.
arXiv Detail & Related papers (2024-10-16T17:59:47Z)
LLM Roleplay: Simulating Human-Chatbot Interaction [52.03241266241294]
We propose a goal-oriented, persona-based method to automatically generate diverse multi-turn dialogues simulating human-chatbot interaction. Our method can simulate human-chatbot dialogues with a high indistinguishability rate.
arXiv Detail & Related papers (2024-07-04T14:49:46Z)
WildChat: 1M ChatGPT Interaction Logs in the Wild [88.05964311416717]
WildChat is a corpus of 1 million user-ChatGPT conversations, which consists of over 2.5 million interaction turns. In addition to timestamped chat transcripts, we enrich the dataset with demographic data, including state, country, and hashed IP addresses.
arXiv Detail & Related papers (2024-05-02T17:00:02Z)
AbuseGPT: Abuse of Generative AI ChatBots to Create Smishing Campaigns [0.0]
We propose AbuseGPT method to show how the existing generative AI-based chatbots can be exploited by attackers in real world to create smishing texts. We have found strong empirical evidences to show that attackers can exploit ethical standards in the existing generative AI-based chatbots services. We also discuss some future research directions and guidelines to protect the abuse of generative AI-based services.
arXiv Detail & Related papers (2024-02-15T05:49:22Z)
Measuring and Controlling Instruction (In)Stability in Language Model Dialogs [72.38330196290119]
System-prompting is a tool for customizing language-model chatbots, enabling them to follow a specific instruction. We propose a benchmark to test the assumption, evaluating instruction stability via self-chats. We reveal a significant instruction drift within eight rounds of conversations. We propose a lightweight method called split-softmax, which compares favorably against two strong baselines.
arXiv Detail & Related papers (2024-02-13T20:10:29Z)
Critical Role of Artificially Intelligent Conversational Chatbot [0.0]
We explore scenarios involving ChatGPT's ethical implications within academic contexts. We propose architectural solutions aimed at preventing inappropriate use and promoting responsible AI interactions.
arXiv Detail & Related papers (2023-10-31T14:08:07Z)
Understanding Multi-Turn Toxic Behaviors in Open-Domain Chatbots [8.763670548363443]
A new attack, toxicbot, is developed to generate toxic responses in a multi-turn conversation. toxicbot can be used by both industry and researchers to develop methods for detecting and mitigating toxic responses in conversational dialogue.
arXiv Detail & Related papers (2023-07-14T03:58:42Z)
A Categorical Archive of ChatGPT Failures [47.64219291655723]
ChatGPT, developed by OpenAI, has been trained using massive amounts of data and simulates human conversation. It has garnered significant attention due to its ability to effectively answer a broad range of human inquiries. However, a comprehensive analysis of ChatGPT's failures is lacking, which is the focus of this study.
arXiv Detail & Related papers (2023-02-06T04:21:59Z)
Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots [24.84440998820146]
This paper presents a first-of-its-kind, large-scale measurement of toxicity in chatbots. We show that publicly available chatbots are prone to providing toxic responses when fed toxic queries. We then set out to design and experiment with an attack, ToxicBuddy, which relies on fine-tuning GPT-2 to generate non-toxic queries.
arXiv Detail & Related papers (2022-09-07T20:45:41Z)
Neural Generation Meets Real People: Building a Social, Informative Open-Domain Dialogue Agent [65.68144111226626]
Chirpy Cardinal aims to be both informative and conversational. We let both the user and bot take turns driving the conversation. Chirpy Cardinal placed second out of nine bots in the Alexa Prize Socialbot Grand Challenge.
arXiv Detail & Related papers (2022-07-25T09:57:23Z)
Put Chatbot into Its Interlocutor's Shoes: New Framework to Learn Chatbot Responding with Intention [55.77218465471519]
This paper proposes an innovative framework to train chatbots to possess human-like intentions. Our framework included a guiding robot and an interlocutor model that plays the role of humans. We examined our framework using three experimental setups and evaluate the guiding robot with four different metrics to demonstrated flexibility and performance advantages.
arXiv Detail & Related papers (2021-03-30T15:24:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.