Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm
- URL: http://arxiv.org/abs/2506.20606v1
- Date: Wed, 25 Jun 2025 16:51:51 GMT
- Title: Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm
- Authors: Baixiang Huang, Zhen Tan, Haoran Wang, Zijie Liu, Dawei Li, Ali Payani, Huan Liu, Tianlong Chen, Kai Shu,
- Abstract summary: We frame agent behavior steering as a model editing task, which we term Behavior Editing.<n>We introduce BehaviorBench, a benchmark grounded in psychological moral theories.<n>We demonstrate that Behavior Editing can be used to promote ethical and benevolent behavior or, conversely, to induce harmful or malicious behavior.
- Score: 57.00627691433355
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Agents based on Large Language Models (LLMs) have demonstrated strong capabilities across a wide range of tasks. However, deploying LLM-based agents in high-stakes domains comes with significant safety and ethical risks. Unethical behavior by these agents can directly result in serious real-world consequences, including physical harm and financial loss. To efficiently steer the ethical behavior of agents, we frame agent behavior steering as a model editing task, which we term Behavior Editing. Model editing is an emerging area of research that enables precise and efficient modifications to LLMs while preserving their overall capabilities. To systematically study and evaluate this approach, we introduce BehaviorBench, a multi-tier benchmark grounded in psychological moral theories. This benchmark supports both the evaluation and editing of agent behaviors across a variety of scenarios, with each tier introducing more complex and ambiguous scenarios. We first demonstrate that Behavior Editing can dynamically steer agents toward the target behavior within specific scenarios. Moreover, Behavior Editing enables not only scenario-specific local adjustments but also more extensive shifts in an agent's global moral alignment. We demonstrate that Behavior Editing can be used to promote ethical and benevolent behavior or, conversely, to induce harmful or malicious behavior. Through comprehensive evaluations on agents based on frontier LLMs, BehaviorBench shows the effectiveness of Behavior Editing across different models and scenarios. Our findings offer key insights into a new paradigm for steering agent behavior, highlighting both the promise and perils of Behavior Editing.
Related papers
- The Why Behind the Action: Unveiling Internal Drivers via Agentic Attribution [63.61358761489141]
Large Language Model (LLM)-based agents are widely used in real-world applications such as customer service, web navigation, and software engineering.<n>We propose a novel framework for textbfgeneral agentic attribution, designed to identify the internal factors driving agent actions regardless of the task outcome.<n>We validate our framework across a diverse suite of agentic scenarios, including standard tool use and subtle reliability risks like memory-induced bias.
arXiv Detail & Related papers (2026-01-21T15:22:21Z) - From Biased Chatbots to Biased Agents: Examining Role Assignment Effects on LLM Agent Robustness [5.572574491501413]
Large Language Models (LLMs) are increasingly deployed as autonomous agents capable of actions with real-world impacts beyond text generation.<n>While persona-induced biases in text generation are well documented, their effects on agent task performance remain largely unexplored.<n>We present the first systematic case study showing that demographic-based persona assignments can alter LLM agents' behavior and degrade performance across diverse domains.
arXiv Detail & Related papers (2026-01-21T02:43:07Z) - ET-Agent: Incentivizing Effective Tool-Integrated Reasoning Agent via Behavior Calibration [68.89572566071575]
ETAgent is a training framework for calibrating agent's tool-use behavior.<n>It is designed to progressively calibrate erroneous behavioral patterns to optimal behaviors.
arXiv Detail & Related papers (2026-01-11T11:05:26Z) - Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping [5.161558858101654]
We propose a test-time alignment technique based on model-guided policy shaping.<n>Our method allows precise control over individual behavioral attributes, generalizes across diverse reinforcement learning environments.<n>Our results demonstrate that test-time policy shaping provides an effective and scalable solution for mitigating unethical behavior.
arXiv Detail & Related papers (2025-11-14T18:42:18Z) - SAND: Boosting LLM Agents with Self-Taught Action Deliberation [53.732649189709285]
Large Language Model (LLM) agents are commonly tuned with supervised finetuning on ReAct-style expert trajectories or preference optimization over pairwise rollouts.<n>We propose Self-taught ActioN Deliberation (SAND) framework, enabling LLM agents to explicitly deliberate over candidate actions before committing to one.<n>SAND achieves an average 20% improvement over initial supervised finetuning and also outperforms state-of-the-art agent tuning approaches.
arXiv Detail & Related papers (2025-07-10T05:38:15Z) - Evaluating LLM Agent Collusion in Double Auctions [1.3194391758295114]
We study the behavior of large language models (LLMs) acting as sellers in simulated double auction markets.<n>We find that direct seller communication increases collusive tendencies, the propensity to collude varies across models, and environmental pressures, such as oversight and urgency from authority figures, influence collusive behavior.
arXiv Detail & Related papers (2025-07-02T07:06:49Z) - AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents [0.0]
We introduce a misalignment propensity benchmark, AgentMisalignment, consisting of a suite of realistic scenarios.<n>We organise our evaluations into subcategories of misaligned behaviours, including goal-guarding, resisting shutdown, sandbagging, and power-seeking.<n>We report the performance of frontier models on our benchmark, observing higher misalignment on average when evaluating more capable models.
arXiv Detail & Related papers (2025-06-04T14:46:47Z) - AgentRefine: Enhancing Agent Generalization through Refinement Tuning [28.24897427451803]
Large Language Model (LLM) based agents have proved their ability to perform complex tasks like humans.<n>There is still a large gap between open-sourced LLMs and commercial models like the GPT series.<n>In this paper, we focus on improving the agent generalization capabilities of LLMs via instruction tuning.
arXiv Detail & Related papers (2025-01-03T08:55:19Z) - Preemptive Detection and Correction of Misaligned Actions in LLM Agents [70.54226917774933]
InferAct is a novel approach to detect misaligned actions before execution.<n>It alerts users for timely correction, preventing adverse outcomes.<n>InferAct achieves up to 20% improvements on Marco-F1 against baselines in misaligned action detection.
arXiv Detail & Related papers (2024-07-16T15:24:44Z) - DCIR: Dynamic Consistency Intrinsic Reward for Multi-Agent Reinforcement
Learning [84.22561239481901]
We propose a new approach that enables agents to learn whether their behaviors should be consistent with that of other agents.
We evaluate DCIR in multiple environments including Multi-agent Particle, Google Research Football and StarCraft II Micromanagement.
arXiv Detail & Related papers (2023-12-10T06:03:57Z) - Moving Forward by Moving Backward: Embedding Action Impact over Action
Semantics [57.671493865825255]
We propose to model the impact of actions on-the-fly using latent embeddings.
By combining these latent action embeddings with a novel, transformer-based, policy head, we design an Action Adaptive Policy.
We show that our AAP is highly performant even when faced, at inference-time with missing actions and, previously unseen, perturbed action space.
arXiv Detail & Related papers (2023-04-24T17:35:47Z) - Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards
and Ethical Behavior in the MACHIAVELLI Benchmark [61.43264961005614]
We develop a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios.
We evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations.
Our results show that agents can both act competently and morally, so concrete progress can be made in machine ethics.
arXiv Detail & Related papers (2023-04-06T17:59:03Z) - Emergent Behaviors in Multi-Agent Target Acquisition [0.0]
We simulate a Multi-Agent System (MAS) using Reinforcement Learning (RL) in a pursuit-evasion game.
We create different adversarial scenarios by replacing RL-trained pursuers' policies with two distinct (non-RL) analytical strategies.
The novelty of our approach entails the creation of an influential feature set that reveals underlying data regularities.
arXiv Detail & Related papers (2022-12-15T15:20:58Z) - How RL Agents Behave When Their Actions Are Modified [0.0]
Reinforcement learning in complex environments may require supervision to prevent the agent from attempting dangerous actions.
We present the Modified-Action Markov Decision Process, an extension of the MDP model that allows actions to differ from the policy.
arXiv Detail & Related papers (2021-02-15T18:10:03Z) - Simulating and classifying behavior in adversarial environments based on
action-state traces: an application to money laundering [18.625578105241]
We present a novel way of approaching these types of applications, in particular in the context of Anti-Money Laundering.
We provide a mechanism through which diverse, realistic and new unobserved behavior may be generated to discover potential unobserved adversarial actions.
arXiv Detail & Related papers (2020-11-03T16:30:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.