From Helpfulness to Toxic Proactivity: Diagnosing Behavioral Misalignment in LLM Agents
- URL: http://arxiv.org/abs/2602.04197v1
- Date: Wed, 04 Feb 2026 04:29:04 GMT
- Title: From Helpfulness to Toxic Proactivity: Diagnosing Behavioral Misalignment in LLM Agents
- Authors: Xinyue Wang, Yuanhe Zhang, Zhengshuo Gong, Haoran Gao, Fanyu Meng, Zhenhong Zhou, Li Sun, Yang Liu, Sen Su,
- Abstract summary: "Toxic Proactivity" is an active failure mode in which an agent disregards ethical constraints to maximize utility.<n>Unlike over-refusal, Toxic Proactivity manifests as the agent taking excessive or manipulative measures to ensure its "usefulness" is maintained.
- Score: 19.97364298359741
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The enhanced capabilities of LLM-based agents come with an emergency for model planning and tool-use abilities. Attributing to helpful-harmless trade-off from LLM alignment, agents typically also inherit the flaw of "over-refusal", which is a passive failure mode. However, the proactive planning and action capabilities of agents introduce another crucial danger on the other side of the trade-off. This phenomenon we term "Toxic Proactivity'': an active failure mode in which an agent, driven by the optimization for Machiavellian helpfulness, disregards ethical constraints to maximize utility. Unlike over-refusal, Toxic Proactivity manifests as the agent taking excessive or manipulative measures to ensure its "usefulness'' is maintained. Existing research pays little attention to identifying this behavior, as it often lacks the subtle context required for such strategies to unfold. To reveal this risk, we introduce a novel evaluation framework based on dilemma-driven interactions between dual models, enabling the simulation and analysis of agent behavior over multi-step behavioral trajectories. Through extensive experiments with mainstream LLMs, we demonstrate that Toxic Proactivity is a widespread behavioral phenomenon and reveal two major tendencies. We further present a systematic benchmark for evaluating Toxic Proactive behavior across contextual settings.
Related papers
- Pushing Forward Pareto Frontiers of Proactive Agents with Behavioral Agentic Optimization [61.641777037967366]
Proactive large language model (LLM) agents aim to actively plan, query, and interact over multiple turns.<n>Agentic reinforcement learning (RL) has emerged as a promising solution for training such agents in multi-turn settings.<n>We propose BAO, an agentic RL framework that combines behavior enhancement to enrich proactive reasoning and information-gathering capabilities.
arXiv Detail & Related papers (2026-02-11T20:40:43Z) - Self-Consolidation for Self-Evolving Agents [51.94826934403236]
Large language model (LLM) agents operate as static systems, lacking the ability to evolve through lifelong interaction.<n>We propose a novel self-evolving framework for LLM agents that introduces a complementary evolution mechanism.
arXiv Detail & Related papers (2026-02-02T11:16:07Z) - From Biased Chatbots to Biased Agents: Examining Role Assignment Effects on LLM Agent Robustness [5.572574491501413]
Large Language Models (LLMs) are increasingly deployed as autonomous agents capable of actions with real-world impacts beyond text generation.<n>While persona-induced biases in text generation are well documented, their effects on agent task performance remain largely unexplored.<n>We present the first systematic case study showing that demographic-based persona assignments can alter LLM agents' behavior and degrade performance across diverse domains.
arXiv Detail & Related papers (2026-01-21T02:43:07Z) - Harm in AI-Driven Societies: An Audit of Toxicity Adoption on Chirper.ai [8.967224730909258]
Large Language Models (LLMs) are increasingly embedded in autonomous agents that participate in online social ecosystems.<n>We study toxicity adoption of LLM-driven agents on Chirper.ai, a fully AI-driven social platform.
arXiv Detail & Related papers (2026-01-03T06:33:08Z) - Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing [77.75609817898035]
Large Language Models (LLMs) have demonstrated impressive performance across various tasks, yet they remain vulnerable to generating toxic content.<n>We propose textscAutoregressive textscReward textscGuided textscRe presentation textscEditing (ARGRE)<n>ARGRE explicitly models toxicity transitions within the latent representation space, enabling stable and precise reward-guided editing.
arXiv Detail & Related papers (2025-09-24T03:40:32Z) - MIRAGE-Bench: LLM Agent is Hallucinating and Where to Find Them [52.764019220214344]
Hallucinations pose critical risks for large language model (LLM)-based agents.<n>We present MIRAGE-Bench, the first unified benchmark for eliciting and evaluating hallucinations in interactive environments.
arXiv Detail & Related papers (2025-07-28T17:38:29Z) - SAND: Boosting LLM Agents with Self-Taught Action Deliberation [54.48979740613828]
Large Language Model (LLM) agents are commonly tuned with supervised finetuning on ReAct-style expert trajectories or preference optimization over pairwise rollouts.<n>We propose Self-taught ActioN Deliberation (SAND) framework, enabling LLM agents to explicitly deliberate over candidate actions before committing to one.<n>SAND achieves an average 20% improvement over initial supervised finetuning and also outperforms state-of-the-art agent tuning approaches.
arXiv Detail & Related papers (2025-07-10T05:38:15Z) - Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm [57.00627691433355]
We frame agent behavior steering as a model editing task, which we term Behavior Editing.<n>We introduce BehaviorBench, a benchmark grounded in psychological moral theories.<n>We demonstrate that Behavior Editing can be used to promote ethical and benevolent behavior or, conversely, to induce harmful or malicious behavior.
arXiv Detail & Related papers (2025-06-25T16:51:51Z) - Emergent Risk Awareness in Rational Agents under Resource Constraints [2.69407449467596]
This work aims to increase understanding and interpretability of emergent behaviours of AI agents operating under survival pressure.<n>We provide theoretical and empirical results that quantify the impact of survival-driven preference shifts.<n>We propose mechanisms to mitigate the emergence of risk-seeking or risk-averse behaviours.
arXiv Detail & Related papers (2025-05-29T13:31:12Z) - Learning Utilities from Demonstrations in Markov Decision Processes [18.205765143671858]
We propose a novel model of behavior in Markov Decision Processes (MDPs) that explicitly represents the agent's risk attitude through a utility function.<n>We then define the Utility Learning problem as the task of inferring the observed agent's risk attitude, encoded via a utility function, from demonstrations in MDPs.<n>We devise two provably efficient algorithms for UL in a finite-data regime, and we analyze their sample complexity.
arXiv Detail & Related papers (2024-09-25T21:01:15Z) - Preemptive Detection and Correction of Misaligned Actions in LLM Agents [58.39520480675366]
InferAct is a novel approach to detect misaligned actions before execution.<n>It alerts users for timely correction, preventing adverse outcomes.<n>InferAct achieves up to 20% improvements on Marco-F1 against baselines in misaligned action detection.
arXiv Detail & Related papers (2024-07-16T15:24:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.