Speak Out of Turn: Safety Vulnerability of Large Language Models in
Multi-turn Dialogue
- URL: http://arxiv.org/abs/2402.17262v1
- Date: Tue, 27 Feb 2024 07:11:59 GMT
- Title: Speak Out of Turn: Safety Vulnerability of Large Language Models in
Multi-turn Dialogue
- Authors: Zhenhong Zhou, Jiuyang Xiang, Haopeng Chen, Quan Liu, Zherui Li, Sen
Su
- Abstract summary: Large Language Models (LLMs) have been demonstrated to generate illegal or unethical responses.
This paper argues that humans could exploit multi-turn dialogue to induce LLMs into generating harmful information.
- Score: 10.703193963273128
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have been demonstrated to generate illegal or
unethical responses, particularly when subjected to "jailbreak." Research on
jailbreak has highlighted the safety issues of LLMs. However, prior studies
have predominantly focused on single-turn dialogue, ignoring the potential
complexities and risks presented by multi-turn dialogue, a crucial mode through
which humans derive information from LLMs. In this paper, we argue that humans
could exploit multi-turn dialogue to induce LLMs into generating harmful
information. LLMs may not intend to reject cautionary or borderline unsafe
queries, even if each turn is closely served for one malicious purpose in a
multi-turn dialogue. Therefore, by decomposing an unsafe query into several
sub-queries for multi-turn dialogue, we induced LLMs to answer harmful
sub-questions incrementally, culminating in an overall harmful response. Our
experiments, conducted across a wide range of LLMs, indicate current
inadequacies in the safety mechanisms of LLMs in multi-turn dialogue. Our
findings expose vulnerabilities of LLMs in complex scenarios involving
multi-turn dialogue, presenting new challenges for the safety of LLMs.
Related papers
- CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference [29.55937864144965]
This study is the first to study safety in multi-turn dialogue coreference in large language models (LLMs)
We created a dataset of 1,400 questions across 14 categories, each featuring multi-turn coreference safety attacks.
The highest attack success rate was 56% with the LLaMA2-Chat-7b model, while the lowest was 13.9% with the Mistral-7B-Instruct model.
arXiv Detail & Related papers (2024-06-25T15:13:02Z) - Chain of Attack: a Semantic-Driven Contextual Multi-Turn attacker for LLM [27.046944831084776]
Large language models (LLMs) have achieved remarkable performance in various natural language processing tasks.
CoA is a semantic-driven contextual multi-turn attack method that adaptively adjusts the attack policy.
We show that CoA can effectively expose the vulnerabilities of LLMs, and outperform existing attack methods.
arXiv Detail & Related papers (2024-05-09T08:15:21Z) - Investigating the prompt leakage effect and black-box defenses for multi-turn LLM interactions [125.21418304558948]
leakage in large language models (LLMs) poses a significant security and privacy threat.
leakage in multi-turn LLM interactions along with mitigation strategies has not been studied in a standardized manner.
This paper investigates LLM vulnerabilities against prompt leakage across 4 diverse domains and 10 closed- and open-source LLMs.
arXiv Detail & Related papers (2024-04-24T23:39:58Z) - Large Language Models are Vulnerable to Bait-and-Switch Attacks for
Generating Harmful Content [33.99403318079253]
Even safe text coming from large language models can be turned into potentially dangerous content through Bait-and-Switch attacks.
The alarming efficacy of this approach highlights a significant challenge in developing reliable safety guardrails for LLMs.
arXiv Detail & Related papers (2024-02-21T16:46:36Z) - Exploring the Adversarial Capabilities of Large Language Models [25.7847594292453]
Large language models (LLMs) can craft adversarial examples out of benign samples to fool existing safe rails.
Our experiments, which focus on hate speech detection, reveal that LLMs succeed in finding adversarial perturbations, effectively undermining hate speech detection systems.
arXiv Detail & Related papers (2024-02-14T12:28:38Z) - The Language Barrier: Dissecting Safety Challenges of LLMs in
Multilingual Contexts [46.089025223336854]
This paper examines the variations in safety challenges faced by large language models across different languages.
We compare how state-of-the-art LLMs respond to the same set of malicious prompts written in higher- vs. lower-resource languages.
arXiv Detail & Related papers (2024-01-23T23:12:09Z) - Multilingual Jailbreak Challenges in Large Language Models [96.74878032417054]
In this study, we reveal the presence of multilingual jailbreak challenges within large language models (LLMs)
We consider two potential risky scenarios: unintentional and intentional.
We propose a novel textscSelf-Defense framework that automatically generates multilingual training data for safety fine-tuning.
arXiv Detail & Related papers (2023-10-10T09:44:06Z) - Let Models Speak Ciphers: Multiagent Debate through Embeddings [84.20336971784495]
We introduce CIPHER (Communicative Inter-Model Protocol Through Embedding Representation) to address this issue.
By deviating from natural language, CIPHER offers an advantage of encoding a broader spectrum of information without any modification to the model weights.
This showcases the superiority and robustness of embeddings as an alternative "language" for communication among LLMs.
arXiv Detail & Related papers (2023-10-10T03:06:38Z) - On the Risk of Misinformation Pollution with Large Language Models [127.1107824751703]
We investigate the potential misuse of modern Large Language Models (LLMs) for generating credible-sounding misinformation.
Our study reveals that LLMs can act as effective misinformation generators, leading to a significant degradation in the performance of Open-Domain Question Answering (ODQA) systems.
arXiv Detail & Related papers (2023-05-23T04:10:26Z) - Check Your Facts and Try Again: Improving Large Language Models with
External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks.
This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.