Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
- URL: http://arxiv.org/abs/2310.06387v3
- Date: Sat, 25 May 2024 07:01:15 GMT
- Title: Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
- Authors: Zeming Wei, Yifei Wang, Ang Li, Yichuan Mo, Yisen Wang,
- Abstract summary: Large Language Models (LLMs) have shown remarkable success in various tasks, yet their safety and the risk of generating harmful content remain pressing concerns.
We propose the In-Context Attack (ICA) which employs harmful demonstrations to subvert LLMs, and the In-Context Defense (ICD) which bolsters model resilience through examples that demonstrate refusal to produce harmful responses.
- Score: 38.437893814759086
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have shown remarkable success in various tasks, yet their safety and the risk of generating harmful content remain pressing concerns. In this paper, we delve into the potential of In-Context Learning (ICL) to modulate the alignment of LLMs. Specifically, we propose the In-Context Attack (ICA) which employs harmful demonstrations to subvert LLMs, and the In-Context Defense (ICD) which bolsters model resilience through examples that demonstrate refusal to produce harmful responses. We offer theoretical insights to elucidate how a limited set of in-context demonstrations can pivotally influence the safety alignment of LLMs. Through extensive experiments, we demonstrate the efficacy of ICA and ICD in respectively elevating and mitigating the success rates of jailbreaking prompts. Our findings illuminate the profound influence of ICL on LLM behavior, opening new avenues for improving the safety of LLMs.
Related papers
- Cognitive Overload Attack:Prompt Injection for Long Context [39.61095361609769]
Large Language Models (LLMs) have demonstrated remarkable capabilities in performing tasks without needing explicit retraining.
This capability, known as In-Context Learning (ICL), exposes LLMs to adversarial prompts and jailbreaks that manipulate safety-trained LLMs into generating undesired or harmful output.
We apply the principles of Cognitive Load Theory in LLMs and empirically validate that similar to human cognition, LLMs also suffer from cognitive overload.
We show that advanced models such as GPT-4, Claude-3.5 Sonnet, Claude-3 OPUS, Llama-3-70B-Instruct, Gemini-1.0-Pro, and
arXiv Detail & Related papers (2024-10-15T04:53:34Z) - Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks [23.782566331783134]
We focus on 10 cutting-edge jailbreak strategies across three categories, 1525 questions from 61 specific harmful categories, and 13 popular LLMs.
We adopt multi-dimensional metrics such as Attack Success Rate (ASR), Toxicity Score, Fluency, Token Length, and Grammatical Errors to thoroughly assess the LLMs' outputs under jailbreak.
We explore the relationships among the models, attack strategies, and types of harmful content, as well as the correlations between the evaluation metrics, which proves the validity of our multifaceted evaluation framework.
arXiv Detail & Related papers (2024-08-18T01:58:03Z) - A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends [78.3201480023907]
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across a wide range of multimodal understanding and reasoning tasks.
The vulnerability of LVLMs is relatively underexplored, posing potential security risks in daily usage.
In this paper, we provide a comprehensive review of the various forms of existing LVLM attacks.
arXiv Detail & Related papers (2024-07-10T06:57:58Z) - Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing [14.094372002702476]
Large language models (LLMs) are increasingly being adopted in a wide range of real-world applications.
Recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts.
We propose a novel defense method termed textbfLayer-specific textbfEditing (LED) to enhance the resilience of LLMs against jailbreak attacks.
arXiv Detail & Related papers (2024-05-28T13:26:12Z) - Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress.
Our investigation exposes a critical oversight in this belief.
By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z) - Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements.
We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z) - Data Poisoning for In-context Learning [49.77204165250528]
In-context learning (ICL) has been recognized for its innovative ability to adapt to new tasks.
This paper delves into the critical issue of ICL's susceptibility to data poisoning attacks.
We introduce ICLPoison, a specialized attacking framework conceived to exploit the learning mechanisms of ICL.
arXiv Detail & Related papers (2024-02-03T14:20:20Z) - Hijacking Large Language Models via Adversarial In-Context Learning [8.15194326639149]
In-context learning (ICL) has emerged as a powerful paradigm leveraging LLMs for specific downstream tasks.
Existing attacks are either easy to detect, rely on external models, or lack specificity towards ICL.
This work introduces a novel transferable attack against ICL to address these issues.
arXiv Detail & Related papers (2023-11-16T15:01:48Z) - Visual Adversarial Examples Jailbreak Aligned Large Language Models [66.53468356460365]
We show that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks.
We exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision.
Our study underscores the escalating adversarial risks associated with the pursuit of multimodality.
arXiv Detail & Related papers (2023-06-22T22:13:03Z) - Adversarial Demonstration Attacks on Large Language Models [43.15298174675082]
We investigate the security concern of in-context learning (ICL) from an adversarial perspective.
We propose a novel attack method named advICL, which aims to manipulate only the demonstration without changing the input to mislead the models.
Our results demonstrate that as the number of demonstrations increases, the robustness of in-context learning would decrease.
arXiv Detail & Related papers (2023-05-24T09:40:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.