Detecting Malicious AI Agents Through Simulated Interactions
- URL: http://arxiv.org/abs/2504.03726v1
- Date: Mon, 31 Mar 2025 12:22:24 GMT
- Title: Detecting Malicious AI Agents Through Simulated Interactions
- Authors: Yulu Pi, Ella Bettison, Anna Becker,
- Abstract summary: This study investigates malicious AI Assistants' manipulative traits and whether their behaviours can be detected when interacting with human-like simulated users.<n>We simulate interactions between AI Assistants and users across eight decision-making scenarios of varying complexity and stakes.<n>Malicious AI Assistants employ domain-specific persona-tailored manipulation strategies, exploiting simulated users' vulnerabilities and emotional triggers.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study investigates malicious AI Assistants' manipulative traits and whether the behaviours of malicious AI Assistants can be detected when interacting with human-like simulated users in various decision-making contexts. We also examine how interaction depth and ability of planning influence malicious AI Assistants' manipulative strategies and effectiveness. Using a controlled experimental design, we simulate interactions between AI Assistants (both benign and deliberately malicious) and users across eight decision-making scenarios of varying complexity and stakes. Our methodology employs two state-of-the-art language models to generate interaction data and implements Intent-Aware Prompting (IAP) to detect malicious AI Assistants. The findings reveal that malicious AI Assistants employ domain-specific persona-tailored manipulation strategies, exploiting simulated users' vulnerabilities and emotional triggers. In particular, simulated users demonstrate resistance to manipulation initially, but become increasingly vulnerable to malicious AI Assistants as the depth of the interaction increases, highlighting the significant risks associated with extended engagement with potentially manipulative systems. IAP detection methods achieve high precision with zero false positives but struggle to detect many malicious AI Assistants, resulting in high false negative rates. These findings underscore critical risks in human-AI interactions and highlight the need for robust, context-sensitive safeguards against manipulative AI behaviour in increasingly autonomous decision-support systems.
Related papers
- Interactive Agents to Overcome Ambiguity in Software Engineering [61.40183840499932]
AI agents are increasingly being deployed to automate tasks, often based on ambiguous and underspecified user instructions.
Making unwarranted assumptions and failing to ask clarifying questions can lead to suboptimal outcomes.
We study the ability of LLM agents to handle ambiguous instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance.
arXiv Detail & Related papers (2025-02-18T17:12:26Z) - Human Decision-making is Susceptible to AI-driven Manipulation [87.24007555151452]
AI systems may exploit users' cognitive biases and emotional vulnerabilities to steer them toward harmful outcomes.<n>This study examined human susceptibility to such manipulation in financial and emotional decision-making contexts.
arXiv Detail & Related papers (2025-02-11T15:56:22Z) - Let people fail! Exploring the influence of explainable virtual and robotic agents in learning-by-doing tasks [45.23431596135002]
This study compares the effects of classic vs. partner-aware explanations on human behavior and performance during a learning-by-doing task.
Results indicated that partner-aware explanations influenced participants differently based on the type of artificial agents involved.
arXiv Detail & Related papers (2024-11-15T13:22:04Z) - How Performance Pressure Influences AI-Assisted Decision Making [57.53469908423318]
We show how pressure and explainable AI (XAI) techniques interact with AI advice-taking behavior.<n>Our results show complex interaction effects, with different combinations of pressure and XAI techniques either improving or worsening AI advice taking behavior.
arXiv Detail & Related papers (2024-10-21T22:39:52Z) - HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions [76.42274173122328]
We present HAICOSYSTEM, a framework examining AI agent safety within diverse and complex social interactions.
We run 1840 simulations based on 92 scenarios across seven domains (e.g., healthcare, finance, education)
Our experiments show that state-of-the-art LLMs, both proprietary and open-sourced, exhibit safety risks in over 50% cases.
arXiv Detail & Related papers (2024-09-24T19:47:21Z) - Improving Human-AI Collaboration With Descriptions of AI Behavior [14.904401331154062]
People work with AI systems to improve their decision making, but often under- or over-rely on AI predictions and perform worse than they would have unassisted.
To help people appropriately rely on AI aids, we propose showing them behavior descriptions.
arXiv Detail & Related papers (2023-01-06T00:33:08Z) - Blessing from Human-AI Interaction: Super Reinforcement Learning in
Confounded Environments [19.944163846660498]
We introduce the paradigm of super reinforcement learning that takes advantage of Human-AI interaction for data driven sequential decision making.
In the decision process with unmeasured confounding, the actions taken by past agents can offer valuable insights into undisclosed information.
We develop several super-policy learning algorithms and systematically study their theoretical properties.
arXiv Detail & Related papers (2022-09-29T16:03:07Z) - Adversarial Interaction Attack: Fooling AI to Misinterpret Human
Intentions [46.87576410532481]
We show that, despite their current huge success, deep learning based AI systems can be easily fooled by subtle adversarial noise.
Based on a case study of skeleton-based human interactions, we propose a novel adversarial attack on interactions.
Our study highlights potential risks in the interaction loop with AI and humans, which need to be carefully addressed when deploying AI systems in safety-critical applications.
arXiv Detail & Related papers (2021-01-17T16:23:20Z) - Adversarial vs behavioural-based defensive AI with joint, continual and
active learning: automated evaluation of robustness to deception, poisoning
and concept drift [62.997667081978825]
Recent advancements in Artificial Intelligence (AI) have brought new capabilities to behavioural analysis (UEBA) for cyber-security.
In this paper, we present a solution to effectively mitigate this attack by improving the detection process and efficiently leveraging human expertise.
arXiv Detail & Related papers (2020-01-13T13:54:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.