Related papers: Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs

Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs

URL: http://arxiv.org/abs/2507.22564v1
Date: Wed, 30 Jul 2025 10:40:53 GMT
Title: Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs
Authors: Xikang Yang, Biyu Zhou, Xuehai Tang, Jizhong Han, Songlin Hu,
Abstract summary: Large Language Models (LLMs) demonstrate impressive capabilities across a wide range of tasks, yet their safety mechanisms remain susceptible to adversarial attacks.<n>We propose CognitiveAttack, a novel framework that systematically leverages both individual and combined cognitive biases.<n> Experimental results reveal significant vulnerabilities across 30 diverse LLMs, particularly in open-source models.
Score: 25.210464491552735
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) demonstrate impressive capabilities across a wide range of tasks, yet their safety mechanisms remain susceptible to adversarial attacks that exploit cognitive biases -- systematic deviations from rational judgment. Unlike prior jailbreaking approaches focused on prompt engineering or algorithmic manipulation, this work highlights the overlooked power of multi-bias interactions in undermining LLM safeguards. We propose CognitiveAttack, a novel red-teaming framework that systematically leverages both individual and combined cognitive biases. By integrating supervised fine-tuning and reinforcement learning, CognitiveAttack generates prompts that embed optimized bias combinations, effectively bypassing safety protocols while maintaining high attack success rates. Experimental results reveal significant vulnerabilities across 30 diverse LLMs, particularly in open-source models. CognitiveAttack achieves a substantially higher attack success rate compared to the SOTA black-box method PAP (60.1% vs. 31.6%), exposing critical limitations in current defense mechanisms. These findings highlight multi-bias interactions as a powerful yet underexplored attack vector. This work introduces a novel interdisciplinary perspective by bridging cognitive science and LLM safety, paving the way for more robust and human-aligned AI systems.

Related papers

Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs [65.6660735371212]
We present textbftextscJustAsk, a framework that autonomously discovers effective extraction strategies through interaction alone.<n>It formulates extraction as an online exploration problem, using Upper Confidence Bound--based strategy selection and a hierarchical skill space spanning atomic probes and high-level orchestration.<n>Our results expose system prompts as a critical yet largely unprotected attack surface in modern agent systems.
arXiv Detail & Related papers (2026-01-29T03:53:25Z)
Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models [50.91504059485288]
We propose a framework that identifies safety-critical attention heads through global optimization over all heads simultaneously.<n>We develop a novel inference-time white-box jailbreak method that exploits the identified safety vectors through activation repatching.
arXiv Detail & Related papers (2026-01-22T09:32:43Z)
Knowledge-Driven Multi-Turn Jailbreaking on Large Language Models [33.30628603365359]
Large Language Models (LLMs) face a significant threat from multi-turn jailbreak attacks.<n>We introduce Mastermind, a multi-turn jailbreak framework that adopts a dynamic and self-improving approach.<n>We conduct comprehensive experiments against state-of-the-art models, including GPT-5 and Claude 3.7 Sonnet.
arXiv Detail & Related papers (2026-01-09T00:27:08Z)
Cognitive Control Architecture (CCA): A Lifecycle Supervision Framework for Robustly Aligned AI Agents [1.014002853673217]
LLM agents are vulnerable to Indirect Prompt Injection (IPI) attacks.<n>IPI attacks hijack agent behavior by polluting external information sources.<n>We propose the Cognitive Control Architecture (CCA), a holistic framework achieving full-lifecycle cognitive supervision.
arXiv Detail & Related papers (2025-12-07T08:11:19Z)
DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z)
ICLShield: Exploring and Mitigating In-Context Learning Backdoor Attacks [61.06621533874629]
In-context learning (ICL) has demonstrated remarkable success in large language models (LLMs)<n>In this paper, we propose, for the first time, the dual-learning hypothesis, which posits that LLMs simultaneously learn both the task-relevant latent concepts and backdoor latent concepts.<n>Motivated by these findings, we propose ICLShield, a defense mechanism that dynamically adjusts the concept preference ratio.
arXiv Detail & Related papers (2025-07-02T03:09:20Z)
Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement [48.50995874445193]
Large Language Models (LLMs) have shown impressive capabilities across various tasks but remain vulnerable to meticulously crafted jailbreak attacks.<n>We propose SAGE (Self-Aware Guard Enhancement), a training-free defense strategy designed to align LLMs' strong safety discrimination performance with their relatively weaker safety generation ability.
arXiv Detail & Related papers (2025-05-17T15:54:52Z)
Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs [83.11815479874447]
We propose a novel jailbreak attack framework, inspired by cognitive decomposition and biases in human cognition.<n>We employ cognitive decomposition to reduce the complexity of malicious prompts and relevance bias to reorganize prompts.<n>We also introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm.
arXiv Detail & Related papers (2025-05-03T05:28:11Z)
ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs [26.07976338566543]
We present ShadowCoT, a novel backdoor attack framework that targets internal reasoning mechanism of LLMs.<n>By conditioning on internal reasoning states, ShadowCoT learns to recognize and selectively disrupt key reasoning steps.<n>Our approach introduces a lightweight yet effective multi-stage injection pipeline, which selectively rewires attention pathways and perturbs intermediate representations.
arXiv Detail & Related papers (2025-04-08T01:36:16Z)
MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks [85.3303135160762]
MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models.<n>It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines.<n>We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
arXiv Detail & Related papers (2025-03-24T20:38:42Z)
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.<n>It reformulates harmful queries into benign reasoning tasks.<n>We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z)
Cognitive Overload Attack:Prompt Injection for Long Context [39.61095361609769]
Large Language Models (LLMs) have demonstrated remarkable capabilities in performing tasks without needing explicit retraining. This capability, known as In-Context Learning (ICL), exposes LLMs to adversarial prompts and jailbreaks that manipulate safety-trained LLMs into generating undesired or harmful output. We apply the principles of Cognitive Load Theory in LLMs and empirically validate that similar to human cognition, LLMs also suffer from cognitive overload. We show that advanced models such as GPT-4, Claude-3.5 Sonnet, Claude-3 OPUS, Llama-3-70B-Instruct, Gemini-1.0-Pro, and
arXiv Detail & Related papers (2024-10-15T04:53:34Z)
LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models [21.02295266675853]
We propose a novel black-box jailbreak attack method, Analyzing-based Jailbreak (ABJ)<n>ABJ comprises two independent attack paths, which exploit the model's multimodal reasoning capabilities to bypass safety mechanisms.<n>Our work reveals a new type of safety risk and highlights the urgent need to mitigate implicit vulnerabilities in the model's reasoning process.
arXiv Detail & Related papers (2024-07-23T06:14:41Z)
Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking [60.78524314357671]
We investigate a novel category of jailbreak attacks specifically designed to target the cognitive structure and processes of large language models (LLMs) Our proposed cognitive overload is a black-box attack with no need for knowledge of model architecture or access to model weights. Experiments conducted on AdvBench and MasterKey reveal that various LLMs, including both popular open-source model Llama 2 and the proprietary model ChatGPT, can be compromised through cognitive overload.
arXiv Detail & Related papers (2023-11-16T11:52:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.