Breaking Minds, Breaking Systems: Jailbreaking Large Language Models via Human-like Psychological Manipulation
- URL: http://arxiv.org/abs/2512.18244v1
- Date: Sat, 20 Dec 2025 07:02:00 GMT
- Title: Breaking Minds, Breaking Systems: Jailbreaking Large Language Models via Human-like Psychological Manipulation
- Authors: Zehao Liu, Xi Lin,
- Abstract summary: Psychological Jailbreak is an attack paradigm that exposes a stateful psychological attack surface in large language models.<n>Human-like Psychological Manipulation (HPM) profiles a target model's latent psychological vulnerabilities and synthesizes tailored multi-turn attack strategies.<n>HPM achieves a mean Attack Success Rate (ASR) of 88.1%, outperforming state-of-the-art attack baselines.
- Score: 6.67891820536196
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have gained considerable popularity and protected by increasingly sophisticated safety mechanisms. However, jailbreak attacks continue to pose a critical security threat by inducing models to generate policy-violating behaviors. Current paradigms focus on input-level anomalies, overlooking that the model's internal psychometric state can be systematically manipulated. To address this, we introduce Psychological Jailbreak, a new jailbreak attack paradigm that exposes a stateful psychological attack surface in LLMs, where attackers exploit the manipulation of a model's psychological state across interactions. Building on this insight, we propose Human-like Psychological Manipulation (HPM), a black-box jailbreak method that dynamically profiles a target model's latent psychological vulnerabilities and synthesizes tailored multi-turn attack strategies. By leveraging the model's optimization for anthropomorphic consistency, HPM creates a psychological pressure where social compliance overrides safety constraints. To systematically measure psychological safety, we construct an evaluation framework incorporating psychometric datasets and the Policy Corruption Score (PCS). Benchmarking against various models (e.g., GPT-4o, DeepSeek-V3, Gemini-2-Flash), HPM achieves a mean Attack Success Rate (ASR) of 88.1%, outperforming state-of-the-art attack baselines. Our experiments demonstrate robust penetration against advanced defenses, including adversarial prompt optimization (e.g., RPO) and cognitive interventions (e.g., Self-Reminder). Ultimately, PCS analysis confirms HPM induces safety breakdown to satisfy manipulated contexts. Our work advocates for a fundamental paradigm shift from static content filtering to psychological safety, prioritizing the development of psychological defense mechanisms against deep cognitive manipulation.
Related papers
- The Silicon Psyche: Anthropomorphic Vulnerabilities in Large Language Models [0.2291770711277359]
Large Language Models (LLMs) are rapidly transitioning from conversational assistants to autonomous agents embedded in critical organizational functions.<n>This paper presents the first systematic application of the Cybersecurity Psychology Framework (cpf), a 100-indicator taxonomy of human psychological vulnerabilities, to non-human cognitive agents.
arXiv Detail & Related papers (2025-12-30T13:25:36Z) - SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models [27.607151919652267]
Large Language Models (LLMs) have achieved impressive performance across diverse natural language processing tasks.<n>But their growing power amplifies potential risks such as jailbreak attacks that circumvent built-in safety mechanisms.<n>We propose SafeBehavior, a novel hierarchical jailbreak defense mechanism that simulates the adaptive multistage reasoning process of humans.
arXiv Detail & Related papers (2025-09-30T14:50:59Z) - NeuroBreak: Unveil Internal Jailbreak Mechanisms in Large Language Models [68.09675063543402]
NeuroBreak is a top-down jailbreak analysis system designed to analyze neuron-level safety mechanisms and mitigate vulnerabilities.<n>By incorporating layer-wise representation probing analysis, NeuroBreak offers a novel perspective on the model's decision-making process.<n>We conduct quantitative evaluations and case studies to verify the effectiveness of our system.
arXiv Detail & Related papers (2025-09-04T08:12:06Z) - Can We End the Cat-and-Mouse Game? Simulating Self-Evolving Phishing Attacks with LLMs and Genetic Algorithms [0.13108652488669734]
Anticipating emerging attack methodologies is crucial for proactive cybersecurity.<n>Recent advances in Large Language Models have enabled the automated generation of phishing messages.<n>We propose a novel framework that integrates LLM-based phishing attack simulations with a genetic algorithm in a psychological context.
arXiv Detail & Related papers (2025-07-29T07:11:11Z) - PsybORG+: Modeling and Simulation for Detecting Cognitive Biases in Advanced Persistent Threats [10.161416622040722]
This work introduces PsybORG$+$, a multi-agent cybersecurity simulation environment designed to model APT behaviors influenced by cognitive vulnerabilities.
A classification model is built for cognitive vulnerability inference and a simulator is designed for synthetic data generation.
Results show that PsybORG$+$ can effectively model APT attackers with different loss aversion and confirmation bias levels.
arXiv Detail & Related papers (2024-08-02T15:00:58Z) - Psychological Profiling in Cybersecurity: A Look at LLMs and Psycholinguistic Features [0.741787275567662]
We explore the potential of psychological profiling techniques, particularly focusing on the utilization of Large Language Models (LLMs) and psycholinguistic features.
Our research underscores the importance of integrating psychological perspectives into cybersecurity practices to bolster defense mechanisms against evolving threats.
arXiv Detail & Related papers (2024-06-26T23:04:52Z) - AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques.
We propose three comprehensive, automated, and logical frameworks.
We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z) - PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety [70.84902425123406]
Multi-agent systems, when enhanced with Large Language Models (LLMs), exhibit profound capabilities in collective intelligence.
However, the potential misuse of this intelligence for malicious purposes presents significant risks.
We propose a framework (PsySafe) grounded in agent psychology, focusing on identifying how dark personality traits in agents can lead to risky behaviors.
Our experiments reveal several intriguing phenomena, such as the collective dangerous behaviors among agents, agents' self-reflection when engaging in dangerous behavior, and the correlation between agents' psychological assessments and dangerous behaviors.
arXiv Detail & Related papers (2024-01-22T12:11:55Z) - Attention-Enhancing Backdoor Attacks Against BERT-based Models [54.070555070629105]
Investigating the strategies of backdoor attacks will help to understand the model's vulnerability.
We propose a novel Trojan Attention Loss (TAL) which enhances the Trojan behavior by directly manipulating the attention patterns.
arXiv Detail & Related papers (2023-10-23T01:24:56Z) - Model-Agnostic Meta-Attack: Towards Reliable Evaluation of Adversarial
Robustness [53.094682754683255]
We propose a Model-Agnostic Meta-Attack (MAMA) approach to discover stronger attack algorithms automatically.
Our method learns the in adversarial attacks parameterized by a recurrent neural network.
We develop a model-agnostic training algorithm to improve the ability of the learned when attacking unseen defenses.
arXiv Detail & Related papers (2021-10-13T13:54:24Z) - Adversarial vs behavioural-based defensive AI with joint, continual and
active learning: automated evaluation of robustness to deception, poisoning
and concept drift [62.997667081978825]
Recent advancements in Artificial Intelligence (AI) have brought new capabilities to behavioural analysis (UEBA) for cyber-security.
In this paper, we present a solution to effectively mitigate this attack by improving the detection process and efficiently leveraging human expertise.
arXiv Detail & Related papers (2020-01-13T13:54:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.