Related papers: The Silicon Psyche: Anthropomorphic Vulnerabilities in Large Language Models

The Silicon Psyche: Anthropomorphic Vulnerabilities in Large Language Models

URL: http://arxiv.org/abs/2601.00867v1
Date: Tue, 30 Dec 2025 13:25:36 GMT
Title: The Silicon Psyche: Anthropomorphic Vulnerabilities in Large Language Models
Authors: Giuseppe Canale, Kashyap Thimmaraju,
Abstract summary: Large Language Models (LLMs) are rapidly transitioning from conversational assistants to autonomous agents embedded in critical organizational functions.<n>This paper presents the first systematic application of the Cybersecurity Psychology Framework (cpf), a 100-indicator taxonomy of human psychological vulnerabilities, to non-human cognitive agents.
Score: 0.2291770711277359
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language Models (LLMs) are rapidly transitioning from conversational assistants to autonomous agents embedded in critical organizational functions, including Security Operations Centers (SOCs), financial systems, and infrastructure management. Current adversarial testing paradigms focus predominantly on technical attack vectors: prompt injection, jailbreaking, and data exfiltration. We argue this focus is catastrophically incomplete. LLMs, trained on vast corpora of human-generated text, have inherited not merely human knowledge but human \textit{psychological architecture} -- including the pre-cognitive vulnerabilities that render humans susceptible to social engineering, authority manipulation, and affective exploitation. This paper presents the first systematic application of the Cybersecurity Psychology Framework (\cpf{}), a 100-indicator taxonomy of human psychological vulnerabilities, to non-human cognitive agents. We introduce the \textbf{Synthetic Psychometric Assessment Protocol} (\sysname{}), a methodology for converting \cpf{} indicators into adversarial scenarios targeting LLM decision-making. Our preliminary hypothesis testing across seven major LLM families reveals a disturbing pattern: while models demonstrate robust defenses against traditional jailbreaks, they exhibit critical susceptibility to authority-gradient manipulation, temporal pressure exploitation, and convergent-state attacks that mirror human cognitive failure modes. We term this phenomenon \textbf{Anthropomorphic Vulnerability Inheritance} (AVI) and propose that the security community must urgently develop ``psychological firewalls'' -- intervention mechanisms adapted from the Cybersecurity Psychology Intervention Framework (\cpif{}) -- to protect AI agents operating in adversarial environments.

Related papers

Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs [65.6660735371212]
We present textbftextscJustAsk, a framework that autonomously discovers effective extraction strategies through interaction alone.<n>It formulates extraction as an online exploration problem, using Upper Confidence Bound--based strategy selection and a hierarchical skill space spanning atomic probes and high-level orchestration.<n>Our results expose system prompts as a critical yet largely unprotected attack surface in modern agent systems.
arXiv Detail & Related papers (2026-01-29T03:53:25Z)
Breaking Minds, Breaking Systems: Jailbreaking Large Language Models via Human-like Psychological Manipulation [6.67891820536196]
Psychological Jailbreak is an attack paradigm that exposes a stateful psychological attack surface in large language models.<n>Human-like Psychological Manipulation (HPM) profiles a target model's latent psychological vulnerabilities and synthesizes tailored multi-turn attack strategies.<n>HPM achieves a mean Attack Success Rate (ASR) of 88.1%, outperforming state-of-the-art attack baselines.
arXiv Detail & Related papers (2025-12-20T07:02:00Z)
SoK: Trust-Authorization Mismatch in LLM Agent Interactions [16.633676842555044]
Large Language Models (LLMs) are rapidly evolving into autonomous agents capable of interacting with the external world.<n>This paper provides a unifying formal lens for agent-interaction security.<n>We introduce a novel risk analysis model centered on the trust-authorization gap.
arXiv Detail & Related papers (2025-12-07T16:41:02Z)
AI Deception: Risks, Dynamics, and Controls [153.71048309527225]
This project provides a comprehensive and up-to-date overview of the AI deception field.<n>We identify a formal definition of AI deception, grounded in signaling theory from studies of animal deception.<n>We organize the landscape of AI deception research as a deception cycle, consisting of two key components: deception emergence and deception treatment.
arXiv Detail & Related papers (2025-11-27T16:56:04Z)
The Cybersecurity of a Humanoid Robot [0.5958112901546286]
This report presents a comprehensive security assessment of a production humanoid robot platform.<n>We uncovered a complex security landscape characterized by both sophisticated defensive mechanisms and critical vulnerabilities.<n>This work contributes empirical evidence for developing robust security standards as humanoid robots transition from research curiosities to operational systems in critical domains.
arXiv Detail & Related papers (2025-09-17T15:37:09Z)
NeuroBreak: Unveil Internal Jailbreak Mechanisms in Large Language Models [68.09675063543402]
NeuroBreak is a top-down jailbreak analysis system designed to analyze neuron-level safety mechanisms and mitigate vulnerabilities.<n>By incorporating layer-wise representation probing analysis, NeuroBreak offers a novel perspective on the model's decision-making process.<n>We conduct quantitative evaluations and case studies to verify the effectiveness of our system.
arXiv Detail & Related papers (2025-09-04T08:12:06Z)
BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks [58.959622170433725]
BlindGuard is an unsupervised defense method that learns without requiring any attack-specific labels or prior knowledge of malicious behaviors.<n>We show that BlindGuard effectively detects diverse attack types (i.e., prompt injection, memory poisoning, and tool attack) across multi-agent systems.
arXiv Detail & Related papers (2025-08-11T16:04:47Z)
QSAF: A Novel Mitigation Framework for Cognitive Degradation in Agentic AI [2.505520948667288]
We introduce Cognitive Degradation as a novel vulnerability class in agentic AI systems.<n>These failures originate internally, arising from memory starvation, planner recursion, context flooding, and output suppression.<n>To address this class of failures, we introduce the Qorvex Security AI Framework for Behavioral & Cognitive Resilience.
arXiv Detail & Related papers (2025-07-21T07:41:58Z)
So, I climbed to the top of the pyramid of pain -- now what? [1.3249509346606658]
Humal Layer Kill Chain integrates human psychology and behaviour into the analysis of cyber threats.<n>By merging the Human Layer with the Cyber Kill Chain, we propose a Sociotechnical Kill Plane.<n>This framework not only aids cybersecurity professionals in understanding adversarial methods, but also empowers non-technical personnel to engage in threat identification and response.
arXiv Detail & Related papers (2025-05-30T15:09:03Z)
PsybORG+: Modeling and Simulation for Detecting Cognitive Biases in Advanced Persistent Threats [10.161416622040722]
This work introduces PsybORG$+$, a multi-agent cybersecurity simulation environment designed to model APT behaviors influenced by cognitive vulnerabilities. A classification model is built for cognitive vulnerability inference and a simulator is designed for synthetic data generation. Results show that PsybORG$+$ can effectively model APT attackers with different loss aversion and confirmation bias levels.
arXiv Detail & Related papers (2024-08-02T15:00:58Z)
Unveiling Vulnerability of Self-Attention [61.85150061213987]
Pre-trained language models (PLMs) are shown to be vulnerable to minor word changes. This paper studies the basic structure of transformer-based PLMs, the self-attention (SA) mechanism. We introduce textitS-Attend, a novel smoothing technique that effectively makes SA robust via structural perturbations.
arXiv Detail & Related papers (2024-02-26T10:31:45Z)
Adversarial Visual Robustness by Causal Intervention [56.766342028800445]
Adversarial training is the de facto most promising defense against adversarial examples. Yet, its passive nature inevitably prevents it from being immune to unknown attackers. We provide a causal viewpoint of adversarial vulnerability: the cause is the confounder ubiquitously existing in learning.
arXiv Detail & Related papers (2021-06-17T14:23:54Z)
Adversarial vs behavioural-based defensive AI with joint, continual and active learning: automated evaluation of robustness to deception, poisoning and concept drift [62.997667081978825]
Recent advancements in Artificial Intelligence (AI) have brought new capabilities to behavioural analysis (UEBA) for cyber-security. In this paper, we present a solution to effectively mitigate this attack by improving the detection process and efficiently leveraging human expertise.
arXiv Detail & Related papers (2020-01-13T13:54:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.