Related papers: Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm

Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm

URL: http://arxiv.org/abs/2506.20606v1
Date: Wed, 25 Jun 2025 16:51:51 GMT
Title: Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm
Authors: Baixiang Huang, Zhen Tan, Haoran Wang, Zijie Liu, Dawei Li, Ali Payani, Huan Liu, Tianlong Chen, Kai Shu,
Abstract summary: We frame agent behavior steering as a model editing task, which we term Behavior Editing.<n>We introduce BehaviorBench, a benchmark grounded in psychological moral theories.<n>We demonstrate that Behavior Editing can be used to promote ethical and benevolent behavior or, conversely, to induce harmful or malicious behavior.
Score: 57.00627691433355
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Agents based on Large Language Models (LLMs) have demonstrated strong capabilities across a wide range of tasks. However, deploying LLM-based agents in high-stakes domains comes with significant safety and ethical risks. Unethical behavior by these agents can directly result in serious real-world consequences, including physical harm and financial loss. To efficiently steer the ethical behavior of agents, we frame agent behavior steering as a model editing task, which we term Behavior Editing. Model editing is an emerging area of research that enables precise and efficient modifications to LLMs while preserving their overall capabilities. To systematically study and evaluate this approach, we introduce BehaviorBench, a multi-tier benchmark grounded in psychological moral theories. This benchmark supports both the evaluation and editing of agent behaviors across a variety of scenarios, with each tier introducing more complex and ambiguous scenarios. We first demonstrate that Behavior Editing can dynamically steer agents toward the target behavior within specific scenarios. Moreover, Behavior Editing enables not only scenario-specific local adjustments but also more extensive shifts in an agent's global moral alignment. We demonstrate that Behavior Editing can be used to promote ethical and benevolent behavior or, conversely, to induce harmful or malicious behavior. Through comprehensive evaluations on agents based on frontier LLMs, BehaviorBench shows the effectiveness of Behavior Editing across different models and scenarios. Our findings offer key insights into a new paradigm for steering agent behavior, highlighting both the promise and perils of Behavior Editing.

Related papers

SAND: Boosting LLM Agents with Self-Taught Action Deliberation [53.732649189709285]
Large Language Model (LLM) agents are commonly tuned with supervised finetuning on ReAct-style expert trajectories or preference optimization over pairwise rollouts.<n>We propose Self-taught ActioN Deliberation (SAND) framework, enabling LLM agents to explicitly deliberate over candidate actions before committing to one.<n>SAND achieves an average 20% improvement over initial supervised finetuning and also outperforms state-of-the-art agent tuning approaches.
arXiv Detail & Related papers (2025-07-10T05:38:15Z)
Evaluating LLM Agent Collusion in Double Auctions [1.3194391758295114]
We study the behavior of large language models (LLMs) acting as sellers in simulated double auction markets.<n>We find that direct seller communication increases collusive tendencies, the propensity to collude varies across models, and environmental pressures, such as oversight and urgency from authority figures, influence collusive behavior.
arXiv Detail & Related papers (2025-07-02T07:06:49Z)
AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents [0.0]
We introduce a misalignment propensity benchmark, AgentMisalignment, consisting of a suite of realistic scenarios.<n>We organise our evaluations into subcategories of misaligned behaviours, including goal-guarding, resisting shutdown, sandbagging, and power-seeking.<n>We report the performance of frontier models on our benchmark, observing higher misalignment on average when evaluating more capable models.
arXiv Detail & Related papers (2025-06-04T14:46:47Z)
AgentRefine: Enhancing Agent Generalization through Refinement Tuning [28.24897427451803]
Large Language Model (LLM) based agents have proved their ability to perform complex tasks like humans.<n>There is still a large gap between open-sourced LLMs and commercial models like the GPT series.<n>In this paper, we focus on improving the agent generalization capabilities of LLMs via instruction tuning.
arXiv Detail & Related papers (2025-01-03T08:55:19Z)
Preemptive Detection and Correction of Misaligned Actions in LLM Agents [70.54226917774933]
InferAct is a novel approach to detect misaligned actions before execution.<n>It alerts users for timely correction, preventing adverse outcomes.<n>InferAct achieves up to 20% improvements on Marco-F1 against baselines in misaligned action detection.
arXiv Detail & Related papers (2024-07-16T15:24:44Z)
DCIR: Dynamic Consistency Intrinsic Reward for Multi-Agent Reinforcement Learning [84.22561239481901]
We propose a new approach that enables agents to learn whether their behaviors should be consistent with that of other agents. We evaluate DCIR in multiple environments including Multi-agent Particle, Google Research Football and StarCraft II Micromanagement.
arXiv Detail & Related papers (2023-12-10T06:03:57Z)
Moving Forward by Moving Backward: Embedding Action Impact over Action Semantics [57.671493865825255]
We propose to model the impact of actions on-the-fly using latent embeddings. By combining these latent action embeddings with a novel, transformer-based, policy head, we design an Action Adaptive Policy. We show that our AAP is highly performant even when faced, at inference-time with missing actions and, previously unseen, perturbed action space.
arXiv Detail & Related papers (2023-04-24T17:35:47Z)
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark [61.43264961005614]
We develop a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios. We evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations. Our results show that agents can both act competently and morally, so concrete progress can be made in machine ethics.
arXiv Detail & Related papers (2023-04-06T17:59:03Z)
Emergent Behaviors in Multi-Agent Target Acquisition [0.0]
We simulate a Multi-Agent System (MAS) using Reinforcement Learning (RL) in a pursuit-evasion game. We create different adversarial scenarios by replacing RL-trained pursuers' policies with two distinct (non-RL) analytical strategies. The novelty of our approach entails the creation of an influential feature set that reveals underlying data regularities.
arXiv Detail & Related papers (2022-12-15T15:20:58Z)
How RL Agents Behave When Their Actions Are Modified [0.0]
Reinforcement learning in complex environments may require supervision to prevent the agent from attempting dangerous actions. We present the Modified-Action Markov Decision Process, an extension of the MDP model that allows actions to differ from the policy.
arXiv Detail & Related papers (2021-02-15T18:10:03Z)
Simulating and classifying behavior in adversarial environments based on action-state traces: an application to money laundering [18.625578105241]
We present a novel way of approaching these types of applications, in particular in the context of Anti-Money Laundering. We provide a mechanism through which diverse, realistic and new unobserved behavior may be generated to discover potential unobserved adversarial actions.
arXiv Detail & Related papers (2020-11-03T16:30:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.