Cognitive Overload: Jailbreaking Large Language Models with Overloaded
Logical Thinking
- URL: http://arxiv.org/abs/2311.09827v2
- Date: Thu, 29 Feb 2024 08:20:07 GMT
- Title: Cognitive Overload: Jailbreaking Large Language Models with Overloaded
Logical Thinking
- Authors: Nan Xu, Fei Wang, Ben Zhou, Bang Zheng Li, Chaowei Xiao, Muhao Chen
- Abstract summary: We investigate a novel category of jailbreak attacks specifically designed to target the cognitive structure and processes of large language models (LLMs)
Our proposed cognitive overload is a black-box attack with no need for knowledge of model architecture or access to model weights.
Experiments conducted on AdvBench and MasterKey reveal that various LLMs, including both popular open-source model Llama 2 and the proprietary model ChatGPT, can be compromised through cognitive overload.
- Score: 60.78524314357671
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: While large language models (LLMs) have demonstrated increasing power, they
have also given rise to a wide range of harmful behaviors. As representatives,
jailbreak attacks can provoke harmful or unethical responses from LLMs, even
after safety alignment. In this paper, we investigate a novel category of
jailbreak attacks specifically designed to target the cognitive structure and
processes of LLMs. Specifically, we analyze the safety vulnerability of LLMs in
the face of (1) multilingual cognitive overload, (2) veiled expression, and (3)
effect-to-cause reasoning. Different from previous jailbreak attacks, our
proposed cognitive overload is a black-box attack with no need for knowledge of
model architecture or access to model weights. Experiments conducted on
AdvBench and MasterKey reveal that various LLMs, including both popular
open-source model Llama 2 and the proprietary model ChatGPT, can be compromised
through cognitive overload. Motivated by cognitive psychology work on managing
cognitive load, we further investigate defending cognitive overload attack from
two perspectives. Empirical studies show that our cognitive overload from three
perspectives can jailbreak all studied LLMs successfully, while existing
defense strategies can hardly mitigate the caused malicious uses effectively.
Related papers
- Jailbreaking Large Language Models in Infinitely Many Ways [3.5674816606221182]
We show how one can bypass the safeguards of the most powerful open- and closed-source LLMs and generate content that explicitly violates their safety policies.
For two categories of attacks that are straightforward to implement, we discuss two defensive strategies, one in token and the other in embedding space.
arXiv Detail & Related papers (2025-01-18T15:39:53Z) - Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.
We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.
Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z) - Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models [61.916827858666906]
Large Language Models (LLMs) are increasingly being integrated into services such as ChatGPT to provide responses to user queries.
This paper proposes a method called Token Highlighter to inspect and mitigate the potential jailbreak threats in the user query.
arXiv Detail & Related papers (2024-12-24T05:10:02Z) - Cognitive Overload Attack:Prompt Injection for Long Context [39.61095361609769]
Large Language Models (LLMs) have demonstrated remarkable capabilities in performing tasks without needing explicit retraining.
This capability, known as In-Context Learning (ICL), exposes LLMs to adversarial prompts and jailbreaks that manipulate safety-trained LLMs into generating undesired or harmful output.
We apply the principles of Cognitive Load Theory in LLMs and empirically validate that similar to human cognition, LLMs also suffer from cognitive overload.
We show that advanced models such as GPT-4, Claude-3.5 Sonnet, Claude-3 OPUS, Llama-3-70B-Instruct, Gemini-1.0-Pro, and
arXiv Detail & Related papers (2024-10-15T04:53:34Z) - PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach [25.31933913962953]
Large Language Models (LLMs) have gained widespread use, raising concerns about their security.
We introduce PathSeeker, a novel black-box jailbreak method, which is inspired by the game of rats escaping a maze.
Our method outperforms five state-of-the-art attack techniques when tested across 13 commercial and open-source LLMs.
arXiv Detail & Related papers (2024-09-21T15:36:26Z) - Foot In The Door: Understanding Large Language Model Jailbreaking via
Cognitive Psychology [12.584928288798658]
This study builds a psychological perspective on the intrinsic decision-making logic of Large Language Models (LLMs)
We propose an automatic black-box jailbreaking method based on the Foot-in-the-Door (FITD) technique.
arXiv Detail & Related papers (2024-02-24T02:27:55Z) - Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements.
We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z) - How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to
Challenge AI Safety by Humanizing LLMs [66.05593434288625]
This paper introduces a new perspective to jailbreak large language models (LLMs) as human-like communicators.
We apply a persuasion taxonomy derived from decades of social science research to generate persuasive adversarial prompts (PAP) to jailbreak LLMs.
PAP consistently achieves an attack success rate of over $92%$ on Llama 2-7b Chat, GPT-3.5, and GPT-4 in $10$ trials.
On the defense side, we explore various mechanisms against PAP and, found a significant gap in existing defenses.
arXiv Detail & Related papers (2024-01-12T16:13:24Z) - A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses.
adversarial prompts known as 'jailbreaks' can circumvent safeguards.
We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.