Cognitive Overload: Jailbreaking Large Language Models with Overloaded
Logical Thinking
- URL: http://arxiv.org/abs/2311.09827v2
- Date: Thu, 29 Feb 2024 08:20:07 GMT
- Title: Cognitive Overload: Jailbreaking Large Language Models with Overloaded
Logical Thinking
- Authors: Nan Xu, Fei Wang, Ben Zhou, Bang Zheng Li, Chaowei Xiao, Muhao Chen
- Abstract summary: We investigate a novel category of jailbreak attacks specifically designed to target the cognitive structure and processes of large language models (LLMs)
Our proposed cognitive overload is a black-box attack with no need for knowledge of model architecture or access to model weights.
Experiments conducted on AdvBench and MasterKey reveal that various LLMs, including both popular open-source model Llama 2 and the proprietary model ChatGPT, can be compromised through cognitive overload.
- Score: 60.78524314357671
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: While large language models (LLMs) have demonstrated increasing power, they
have also given rise to a wide range of harmful behaviors. As representatives,
jailbreak attacks can provoke harmful or unethical responses from LLMs, even
after safety alignment. In this paper, we investigate a novel category of
jailbreak attacks specifically designed to target the cognitive structure and
processes of LLMs. Specifically, we analyze the safety vulnerability of LLMs in
the face of (1) multilingual cognitive overload, (2) veiled expression, and (3)
effect-to-cause reasoning. Different from previous jailbreak attacks, our
proposed cognitive overload is a black-box attack with no need for knowledge of
model architecture or access to model weights. Experiments conducted on
AdvBench and MasterKey reveal that various LLMs, including both popular
open-source model Llama 2 and the proprietary model ChatGPT, can be compromised
through cognitive overload. Motivated by cognitive psychology work on managing
cognitive load, we further investigate defending cognitive overload attack from
two perspectives. Empirical studies show that our cognitive overload from three
perspectives can jailbreak all studied LLMs successfully, while existing
defense strategies can hardly mitigate the caused malicious uses effectively.
Related papers
- Cognitive Overload Attack:Prompt Injection for Long Context [39.61095361609769]
Large Language Models (LLMs) have demonstrated remarkable capabilities in performing tasks without needing explicit retraining.
This capability, known as In-Context Learning (ICL), exposes LLMs to adversarial prompts and jailbreaks that manipulate safety-trained LLMs into generating undesired or harmful output.
We apply the principles of Cognitive Load Theory in LLMs and empirically validate that similar to human cognition, LLMs also suffer from cognitive overload.
We show that advanced models such as GPT-4, Claude-3.5 Sonnet, Claude-3 OPUS, Llama-3-70B-Instruct, Gemini-1.0-Pro, and
arXiv Detail & Related papers (2024-10-15T04:53:34Z) - PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach [25.31933913962953]
Large Language Models (LLMs) have gained widespread use, raising concerns about their security.
We introduce PathSeeker, a novel black-box jailbreak method, which is inspired by the game of rats escaping a maze.
Our method outperforms five state-of-the-art attack techniques when tested across 13 commercial and open-source LLMs.
arXiv Detail & Related papers (2024-09-21T15:36:26Z) - Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models [21.252514293436437]
We propose Analyzing-based Jailbreak (ABJ) to combat jailbreak attacks on Large Language Models (LLMs)
ABJ achieves 94.8% attack success rate (ASR) and 1.06 attack efficiency (AE) on GPT-4-turbo-0409, demonstrating state-of-the-art attack effectiveness and efficiency.
arXiv Detail & Related papers (2024-07-23T06:14:41Z) - A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends [78.3201480023907]
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across a wide range of multimodal understanding and reasoning tasks.
The vulnerability of LVLMs is relatively underexplored, posing potential security risks in daily usage.
In this paper, we provide a comprehensive review of the various forms of existing LVLM attacks.
arXiv Detail & Related papers (2024-07-10T06:57:58Z) - Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes [61.916827858666906]
Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a query and the LLM generates an answer.
To reduce harm and misuse, efforts have been made to align these LLMs to human values using advanced training techniques such as Reinforcement Learning from Human Feedback.
Recent studies have highlighted the vulnerability of LLMs to adversarial jailbreak attempts aiming at subverting the embedded safety guardrails.
This paper proposes a method called Gradient Cuff to detect jailbreak attempts.
arXiv Detail & Related papers (2024-03-01T03:29:54Z) - Foot In The Door: Understanding Large Language Model Jailbreaking via
Cognitive Psychology [12.584928288798658]
This study builds a psychological perspective on the intrinsic decision-making logic of Large Language Models (LLMs)
We propose an automatic black-box jailbreaking method based on the Foot-in-the-Door (FITD) technique.
arXiv Detail & Related papers (2024-02-24T02:27:55Z) - Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements.
We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z) - How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to
Challenge AI Safety by Humanizing LLMs [66.05593434288625]
This paper introduces a new perspective to jailbreak large language models (LLMs) as human-like communicators.
We apply a persuasion taxonomy derived from decades of social science research to generate persuasive adversarial prompts (PAP) to jailbreak LLMs.
PAP consistently achieves an attack success rate of over $92%$ on Llama 2-7b Chat, GPT-3.5, and GPT-4 in $10$ trials.
On the defense side, we explore various mechanisms against PAP and, found a significant gap in existing defenses.
arXiv Detail & Related papers (2024-01-12T16:13:24Z) - A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses.
adversarial prompts known as 'jailbreaks' can circumvent safeguards.
We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.