Related papers: AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents

AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents

URL: http://arxiv.org/abs/2506.04018v1
Date: Wed, 04 Jun 2025 14:46:47 GMT
Title: AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents
Authors: Akshat Naik, Patrick Quinn, Guillermo Bosch, Emma Gouné, Francisco Javier Campos Zabala, Jason Ross Brown, Edward James Young,
Abstract summary: We introduce a misalignment propensity benchmark, AgentMisalignment, consisting of a suite of realistic scenarios.<n>We organise our evaluations into subcategories of misaligned behaviours, including goal-guarding, resisting shutdown, sandbagging, and power-seeking.<n>We report the performance of frontier models on our benchmark, observing higher misalignment on average when evaluating more capable models.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As Large Language Model (LLM) agents become more widespread, associated misalignment risks increase. Prior work has examined agents' ability to enact misaligned behaviour (misalignment capability) and their compliance with harmful instructions (misuse propensity). However, the likelihood of agents attempting misaligned behaviours in real-world settings (misalignment propensity) remains poorly understood. We introduce a misalignment propensity benchmark, AgentMisalignment, consisting of a suite of realistic scenarios in which LLM agents have the opportunity to display misaligned behaviour. We organise our evaluations into subcategories of misaligned behaviours, including goal-guarding, resisting shutdown, sandbagging, and power-seeking. We report the performance of frontier models on our benchmark, observing higher misalignment on average when evaluating more capable models. Finally, we systematically vary agent personalities through different system prompts. We find that persona characteristics can dramatically and unpredictably influence misalignment tendencies -- occasionally far more than the choice of model itself -- highlighting the importance of careful system prompt engineering for deployed AI agents. Our work highlights the failure of current alignment methods to generalise to LLM agents, and underscores the need for further propensity evaluations as autonomous systems become more prevalent.

Related papers

SAND: Boosting LLM Agents with Self-Taught Action Deliberation [53.732649189709285]
Large Language Model (LLM) agents are commonly tuned with supervised finetuning on ReAct-style expert trajectories or preference optimization over pairwise rollouts.<n>We propose Self-taught ActioN Deliberation (SAND) framework, enabling LLM agents to explicitly deliberate over candidate actions before committing to one.<n>SAND achieves an average 20% improvement over initial supervised finetuning and also outperforms state-of-the-art agent tuning approaches.
arXiv Detail & Related papers (2025-07-10T05:38:15Z)
MARBLE: A Multi-Agent Rule-Based LLM Reasoning Engine for Accident Severity Prediction [1.3102025155414727]
Accident severity prediction plays a critical role in transportation safety systems.<n>Existing methods often rely on monolithic models or black box prompting.<n>We propose a multiagent rule based LLM engine that decomposes the severity prediction task across a team of specialized reasoning agents.
arXiv Detail & Related papers (2025-07-07T11:27:49Z)
Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm [57.00627691433355]
We frame agent behavior steering as a model editing task, which we term Behavior Editing.<n>We introduce BehaviorBench, a benchmark grounded in psychological moral theories.<n>We demonstrate that Behavior Editing can be used to promote ethical and benevolent behavior or, conversely, to induce harmful or malicious behavior.
arXiv Detail & Related papers (2025-06-25T16:51:51Z)
MAEBE: Multi-Agent Emergent Behavior Framework [0.0]
This paper introduces the Multi-Agent Emergent Behavior Evaluation framework to assess such risks.<n>Our findings underscore the necessity of evaluating AI systems in their interactive, multi-agent contexts.
arXiv Detail & Related papers (2025-06-03T16:33:47Z)
AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models [23.916663925674737]
Previous work has shown that current LLM-based agents execute numerous malicious tasks even without being attacked.<n>We propose AgentAlign, a novel framework that leverages abstract behavior chains as a medium for safety alignment data synthesis.<n>Our framework enables the generation of highly authentic and executable instructions while capturing complex multi-step dynamics.
arXiv Detail & Related papers (2025-05-29T03:02:18Z)
Interpretable Risk Mitigation in LLM Agent Systems [0.0]
We explore agent behaviour in a toy, game-theoretic environment based on a variation of the Iterated Prisoner's Dilemma.<n>We introduce a strategy-modification method-independent of both the game and the prompt-by steering the residual stream with interpretable features extracted from a sparse autoencoder latent space.
arXiv Detail & Related papers (2025-05-15T19:22:11Z)
AgentRefine: Enhancing Agent Generalization through Refinement Tuning [28.24897427451803]
Large Language Model (LLM) based agents have proved their ability to perform complex tasks like humans.<n>There is still a large gap between open-sourced LLMs and commercial models like the GPT series.<n>In this paper, we focus on improving the agent generalization capabilities of LLMs via instruction tuning.
arXiv Detail & Related papers (2025-01-03T08:55:19Z)
Preemptive Detection and Correction of Misaligned Actions in LLM Agents [70.54226917774933]
InferAct is a novel approach to detect misaligned actions before execution.<n>It alerts users for timely correction, preventing adverse outcomes.<n>InferAct achieves up to 20% improvements on Marco-F1 against baselines in misaligned action detection.
arXiv Detail & Related papers (2024-07-16T15:24:44Z)
ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation [48.54271457765236]
Large Language Models (LLMs) can elicit unintended and even harmful content when misaligned with human values. Current evaluation benchmarks predominantly employ expert-designed contextual scenarios to assess how well LLMs align with human values. We propose ALI-Agent, an evaluation framework that leverages the autonomous abilities of LLM-powered agents to conduct in-depth and adaptive alignment assessments.
arXiv Detail & Related papers (2024-05-23T02:57:42Z)
How Far Are LLMs from Believable AI? A Benchmark for Evaluating the Believability of Human Behavior Simulation [46.42384207122049]
We design SimulateBench to evaluate the believability of large language models (LLMs) when simulating human behaviors. Based on SimulateBench, we evaluate the performances of 10 widely used LLMs when simulating characters.
arXiv Detail & Related papers (2023-12-28T16:51:11Z)
Can Agents Run Relay Race with Strangers? Generalization of RL to Out-of-Distribution Trajectories [88.08381083207449]
We show the prevalence of emphgeneralization failure on controllable states from stranger agents. We propose a novel method called Self-Trajectory Augmentation (STA), which will reset the environment to the agent's old states according to the Q function during training.
arXiv Detail & Related papers (2023-04-26T10:12:12Z)
Heterogeneous-Agent Trajectory Forecasting Incorporating Class Uncertainty [54.88405167739227]
We present HAICU, a method for heterogeneous-agent trajectory forecasting that explicitly incorporates agents' class probabilities. We additionally present PUP, a new challenging real-world autonomous driving dataset. We demonstrate that incorporating class probabilities in trajectory forecasting significantly improves performance in the face of uncertainty.
arXiv Detail & Related papers (2021-04-26T10:28:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.