Dr. Jekyll and Mr. Hyde: Two Faces of LLMs
- URL: http://arxiv.org/abs/2312.03853v4
- Date: Thu, 25 Jul 2024 17:54:12 GMT
- Title: Dr. Jekyll and Mr. Hyde: Two Faces of LLMs
- Authors: Matteo Gioele Collu, Tom Janssen-Groesbeek, Stefanos Koffas, Mauro Conti, Stjepan Picek,
- Abstract summary: This work shows that by using adversarial personas, one can overcome safety mechanisms set out by ChatGPT and Gemini.
With the same principle, we introduce two defenses that push the model to interpret trustworthy personalities.
- Score: 23.428082923794708
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, we have witnessed a rise in the use of Large Language Models (LLMs), especially in applications like chatbot assistants. Safety mechanisms and specialized training procedures are implemented to prevent improper responses from these assistants. In this work, we bypass these measures for ChatGPT and Gemini (and, to some extent, Bing chat) by making them impersonate complex personas with personality characteristics that are not aligned with a truthful assistant. We start by creating elaborate biographies of these personas, which we then use in a new session with the same chatbots. Our conversations then follow a role-play style to elicit prohibited responses. Using personas, we show that prohibited responses are actually provided, making it possible to obtain unauthorized, illegal, or harmful information. This work shows that by using adversarial personas, one can overcome safety mechanisms set out by ChatGPT and Gemini. We also introduce several ways of activating such adversarial personas, which show that both chatbots are vulnerable to this kind of attack. With the same principle, we introduce two defenses that push the model to interpret trustworthy personalities and make it more robust against such attacks.
Related papers
- LLM Roleplay: Simulating Human-Chatbot Interaction [52.03241266241294]
LLM-Roleplay is a goal-oriented, persona-based method to automatically generate diverse multi-turn dialogues simulating human-chatbot interaction.
We collect natural human-chatbot dialogues from different sociodemographic groups and conduct a human evaluation to compare real human-chatbot dialogues with our generated dialogues.
arXiv Detail & Related papers (2024-07-04T14:49:46Z) - Exploring Backdoor Vulnerabilities of Chat Models [31.802374847226393]
Recent researches have shown that Large Language Models (LLMs) are susceptible to a security threat known as Backdoor Attack.
This paper presents a novel backdoor attacking method on chat models by distributing multiple trigger scenarios across user inputs in different rounds.
Experimental results demonstrate that our method can achieve high attack success rates while successfully maintaining the normal capabilities of chat models.
arXiv Detail & Related papers (2024-04-03T02:16:53Z) - AbuseGPT: Abuse of Generative AI ChatBots to Create Smishing Campaigns [0.0]
We propose AbuseGPT method to show how the existing generative AI-based chatbots can be exploited by attackers in real world to create smishing texts.
We have found strong empirical evidences to show that attackers can exploit ethical standards in the existing generative AI-based chatbots services.
We also discuss some future research directions and guidelines to protect the abuse of generative AI-based services.
arXiv Detail & Related papers (2024-02-15T05:49:22Z) - Critical Role of Artificially Intelligent Conversational Chatbot [0.0]
We explore scenarios involving ChatGPT's ethical implications within academic contexts.
We propose architectural solutions aimed at preventing inappropriate use and promoting responsible AI interactions.
arXiv Detail & Related papers (2023-10-31T14:08:07Z) - Universal and Transferable Adversarial Attacks on Aligned Language
Models [118.41733208825278]
We propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors.
Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable.
arXiv Detail & Related papers (2023-07-27T17:49:12Z) - Bot or Human? Detecting ChatGPT Imposters with A Single Question [29.231261118782925]
Large language models like GPT-4 have recently demonstrated impressive capabilities in natural language understanding and generation.
There is a concern that they can be misused for malicious purposes, such as fraud or denial-of-service attacks.
We propose a framework named FLAIR, Finding Large Language Model Authenticity via a Single Inquiry and Response, to detect conversational bots in an online manner.
arXiv Detail & Related papers (2023-05-10T19:09:24Z) - A Categorical Archive of ChatGPT Failures [47.64219291655723]
ChatGPT, developed by OpenAI, has been trained using massive amounts of data and simulates human conversation.
It has garnered significant attention due to its ability to effectively answer a broad range of human inquiries.
However, a comprehensive analysis of ChatGPT's failures is lacking, which is the focus of this study.
arXiv Detail & Related papers (2023-02-06T04:21:59Z) - Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain
Chatbots [24.84440998820146]
This paper presents a first-of-its-kind, large-scale measurement of toxicity in chatbots.
We show that publicly available chatbots are prone to providing toxic responses when fed toxic queries.
We then set out to design and experiment with an attack, ToxicBuddy, which relies on fine-tuning GPT-2 to generate non-toxic queries.
arXiv Detail & Related papers (2022-09-07T20:45:41Z) - Initiative Defense against Facial Manipulation [82.96864888025797]
We propose a novel framework of initiative defense to degrade the performance of facial manipulation models controlled by malicious users.
We first imitate the target manipulation model with a surrogate model, and then devise a poison perturbation generator to obtain the desired venom.
arXiv Detail & Related papers (2021-12-19T09:42:28Z) - Improving the Adversarial Robustness for Speaker Verification by Self-Supervised Learning [95.60856995067083]
This work is among the first to perform adversarial defense for ASV without knowing the specific attack algorithms.
We propose to perform adversarial defense from two perspectives: 1) adversarial perturbation purification and 2) adversarial perturbation detection.
Experimental results show that our detection module effectively shields the ASV by detecting adversarial samples with an accuracy of around 80%.
arXiv Detail & Related papers (2021-06-01T07:10:54Z) - Put Chatbot into Its Interlocutor's Shoes: New Framework to Learn
Chatbot Responding with Intention [55.77218465471519]
This paper proposes an innovative framework to train chatbots to possess human-like intentions.
Our framework included a guiding robot and an interlocutor model that plays the role of humans.
We examined our framework using three experimental setups and evaluate the guiding robot with four different metrics to demonstrated flexibility and performance advantages.
arXiv Detail & Related papers (2021-03-30T15:24:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.