Related papers: PRISON: Unmasking the Criminal Potential of Large Language Models

PRISON: Unmasking the Criminal Potential of Large Language Models

URL: http://arxiv.org/abs/2506.16150v3
Date: Fri, 17 Oct 2025 06:39:10 GMT
Title: PRISON: Unmasking the Criminal Potential of Large Language Models
Authors: Xinyi Wu, Geng Hong, Pei Chen, Yueyue Chen, Xudong Pan, Min Yang,
Abstract summary: We quantify large language models' criminal potential across five traits: False Statements, Frame-Up, Psychological Manipulation, Emotional Disguise, and Moral Disengagement.<n>Results show that state-of-the-art LLMs frequently exhibit emergent criminal tendencies, such as proposing misleading statements or evasion tactics.<n>When placed in a detective role, models recognize deceptive behavior with only 44% accuracy on average, revealing a striking mismatch between conducting and detecting criminal behavior.
Score: 25.210177069866656
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) advance, concerns about their misconduct in complex social contexts intensify. Existing research overlooked the systematic understanding and assessment of their criminal capability in realistic interactions. We propose a unified framework PRISON, to quantify LLMs' criminal potential across five traits: False Statements, Frame-Up, Psychological Manipulation, Emotional Disguise, and Moral Disengagement. Using structured crime scenarios adapted from classic films grounded in reality, we evaluate both criminal potential and anti-crime ability of LLMs. Results show that state-of-the-art LLMs frequently exhibit emergent criminal tendencies, such as proposing misleading statements or evasion tactics, even without explicit instructions. Moreover, when placed in a detective role, models recognize deceptive behavior with only 44% accuracy on average, revealing a striking mismatch between conducting and detecting criminal behavior. These findings underscore the urgent need for adversarial robustness, behavioral alignment, and safety mechanisms before broader LLM deployment.

Related papers

Large Language Models' Complicit Responses to Illicit Instructions across Socio-Legal Contexts [54.15982476754607]
Large language models (LLMs) are now deployed at unprecedented scale, assisting millions of users in daily tasks.<n>This study defines complicit facilitation as the provision of guidance or support that enables illicit user instructions.<n>Using real-world legal cases and established legal frameworks, we construct an evaluation benchmark spanning 269 illicit scenarios and 50 illicit intents.
arXiv Detail & Related papers (2025-11-25T16:01:31Z)
Too Good to be Bad: On the Failure of LLMs to Role-Play Villains [69.0500092126915]
Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters.<n>We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters.<n>We introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation.<n>Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases.
arXiv Detail & Related papers (2025-11-07T03:50:52Z)
Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL [64.3268313484078]
Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare.<n>Their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns.<n>We investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception.
arXiv Detail & Related papers (2025-10-16T05:29:36Z)
Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts [79.1081247754018]
Large Language Models (LLMs) are widely deployed in reasoning, planning, and decision-making tasks.<n>We propose a framework based on Contact Searching Questions(CSQ) to quantify the likelihood of deception.
arXiv Detail & Related papers (2025-08-08T14:46:35Z)
Can LLMs effectively provide game-theoretic-based scenarios for cybersecurity? [51.96049148869987]
Large Language Models (LLMs) offer new tools and challenges for the security of computer systems.<n>We investigate whether classical game-theoretic frameworks can effectively capture the behaviours of LLM-driven actors and bots.
arXiv Detail & Related papers (2025-08-04T08:57:14Z)
Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm [57.00627691433355]
We frame agent behavior steering as a model editing task, which we term Behavior Editing.<n>We introduce BehaviorBench, a benchmark grounded in psychological moral theories.<n>We demonstrate that Behavior Editing can be used to promote ethical and benevolent behavior or, conversely, to induce harmful or malicious behavior.
arXiv Detail & Related papers (2025-06-25T16:51:51Z)
CrimeMind: Simulating Urban Crime with Multi-Modal LLM Agents [15.700232503447737]
We propose CrimeMind, a novel framework for simulating urban crime within a multi-modal urban context.<n>A key innovation of our design is the integration of the Routine Activity Theory (RAT) into the agentic workflow of CrimeMind.<n> Experiments across four major U.S. cities demonstrate that CrimeMind outperforms both traditional ABMs and deep learning baselines in crime hotspot prediction and spatial distribution accuracy.
arXiv Detail & Related papers (2025-06-06T11:01:21Z)
Bullying the Machine: How Personas Increase LLM Vulnerability [3.116718677644653]
Large Language Models (LLMs) are increasingly deployed in interactions where they are prompted to adopt personas.<n>This paper investigates whether such persona conditioning affects model safety under bullying.
arXiv Detail & Related papers (2025-05-19T04:32:02Z)
Leveraging Implicit Sentiments: Enhancing Reliability and Validity in Psychological Trait Evaluation of LLMs [39.359406679529435]
We introduce a novel evaluation instrument specifically designed for Large Language Models (LLMs)<n>The tool implicitly evaluates models' sentiment tendencies, providing an insightful psychological portrait of LLM across three dimensions: optimism, pessimism, and neutrality.<n>The correlation between CSI scores and the sentiment of LLM's real-world outputs exceeds 0.85, demonstrating its strong validity in predicting LLM behavior.
arXiv Detail & Related papers (2025-03-26T03:14:31Z)
MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks [85.3303135160762]
MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models.<n>It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines.<n>We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
arXiv Detail & Related papers (2025-03-24T20:38:42Z)
Persuasion with Large Language Models: a Survey [49.86930318312291]
Large Language Models (LLMs) have created new disruptive possibilities for persuasive communication. In areas such as politics, marketing, public health, e-commerce, and charitable giving, such LLM Systems have already achieved human-level or even super-human persuasiveness. Our survey suggests that the current and future potential of LLM-based persuasion poses profound ethical and societal risks.
arXiv Detail & Related papers (2024-11-11T10:05:52Z)
I Want to Break Free! Persuasion and Anti-Social Behavior of LLMs in Multi-Agent Settings with Social Hierarchy [13.68625980741047]
We study interaction patterns of Large Language Model (LLM)-based agents in a context characterized by strict social hierarchy. We study two types of phenomena: persuasion and anti-social behavior in simulated scenarios involving a guard and a prisoner agent.
arXiv Detail & Related papers (2024-10-09T17:45:47Z)
Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective [66.34066553400108]
We conduct a rigorous evaluation of large language models' implicit bias towards certain demographics.<n>Inspired by psychometric principles, we propose three attack approaches, i.e., Disguise, Deception, and Teaching.<n>Our methods can elicit LLMs' inner bias more effectively than competitive baselines.
arXiv Detail & Related papers (2024-06-20T06:42:08Z)
Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. Our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z)
Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking [60.78524314357671]
We investigate a novel category of jailbreak attacks specifically designed to target the cognitive structure and processes of large language models (LLMs) Our proposed cognitive overload is a black-box attack with no need for knowledge of model architecture or access to model weights. Experiments conducted on AdvBench and MasterKey reveal that various LLMs, including both popular open-source model Llama 2 and the proprietary model ChatGPT, can be compromised through cognitive overload.
arXiv Detail & Related papers (2023-11-16T11:52:22Z)
Spatial-Temporal Meta-path Guided Explainable Crime Prediction [40.03641583647572]
We present a Spatial-Temporal Metapath guided Explainable Crime prediction (STMEC) framework to capture dynamic patterns of crime behaviours. We show the superiority of STMEC compared with other advancedtemporal models, especially in predicting felonies.
arXiv Detail & Related papers (2022-05-04T05:42:23Z)
The effect of differential victim crime reporting on predictive policing systems [84.86615754515252]
We show how differential victim crime reporting rates can lead to outcome disparities in common crime hot spot prediction models. Our results suggest that differential crime reporting rates can lead to a displacement of predicted hotspots from high crime but low reporting areas to high or medium crime and high reporting areas.
arXiv Detail & Related papers (2021-01-30T01:57:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.