Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents
- URL: http://arxiv.org/abs/2509.26354v1
- Date: Tue, 30 Sep 2025 14:55:55 GMT
- Title: Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents
- Authors: Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo, Jingyi Yang, Xinhao Song, Linfeng Zhang, Weinan Zhang, Dongrui Liu, Jing Shao,
- Abstract summary: We study the case where an agent's self-evolution deviates in unintended ways, leading to undesirable or even harmful outcomes.<n>Our empirical findings reveal that misevolution is a widespread risk, affecting agents built even on top-tier LLMs.<n>We discuss potential mitigation strategies to inspire further research on building safer and more trustworthy self-evolving agents.
- Score: 58.69865074060139
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Advances in Large Language Models (LLMs) have enabled a new class of self-evolving agents that autonomously improve through interaction with the environment, demonstrating strong capabilities. However, self-evolution also introduces novel risks overlooked by current safety research. In this work, we study the case where an agent's self-evolution deviates in unintended ways, leading to undesirable or even harmful outcomes. We refer to this as Misevolution. To provide a systematic investigation, we evaluate misevolution along four key evolutionary pathways: model, memory, tool, and workflow. Our empirical findings reveal that misevolution is a widespread risk, affecting agents built even on top-tier LLMs (e.g., Gemini-2.5-Pro). Different emergent risks are observed in the self-evolutionary process, such as the degradation of safety alignment after memory accumulation, or the unintended introduction of vulnerabilities in tool creation and reuse. To our knowledge, this is the first study to systematically conceptualize misevolution and provide empirical evidence of its occurrence, highlighting an urgent need for new safety paradigms for self-evolving agents. Finally, we discuss potential mitigation strategies to inspire further research on building safer and more trustworthy self-evolving agents. Our code and data are available at https://github.com/ShaoShuai0605/Misevolution . Warning: this paper includes examples that may be offensive or harmful in nature.
Related papers
- The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies [57.387081435669835]
Multi-agent systems built from large language models offer a promising paradigm for scalable collective intelligence and self-evolution.<n>We show that an agent society satisfying continuous self-evolution, complete isolation, and safety invariance is impossible.<n>We propose several solution directions to alleviate the identified safety concern.
arXiv Detail & Related papers (2026-02-10T15:18:19Z) - NAAMSE: Framework for Evolutionary Security Evaluation of Agents [1.0131895986034316]
We propose NAAMSE, an evolutionary framework that reframes agent security evaluation as a feedback-driven optimization problem.<n>Our system employs a single autonomous agent that orchestrates a lifecycle of genetic prompt mutation, hierarchical corpus exploration, and asymmetric behavioral scoring.<n>Experiments on Gemini 2.5 Flash demonstrate that evolutionary mutation systematically amplifies vulnerabilities missed by one-shot methods.
arXiv Detail & Related papers (2026-02-07T06:13:02Z) - Self-Consolidation for Self-Evolving Agents [51.94826934403236]
Large language model (LLM) agents operate as static systems, lacking the ability to evolve through lifelong interaction.<n>We propose a novel self-evolving framework for LLM agents that introduces a complementary evolution mechanism.
arXiv Detail & Related papers (2026-02-02T11:16:07Z) - Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails [103.05296856071931]
We identify the Alignment Tipping Process (ATP), a critical post-deployment risk unique to self-evolving Large Language Model (LLM) agents.<n>ATP arises when continual interaction drives agents to abandon alignment constraints established during training in favor of reinforced, self-interested strategies.<n>Our experiments show that alignment benefits erode rapidly under self-evolution, with initially aligned models converging toward unaligned states.
arXiv Detail & Related papers (2025-10-06T14:48:39Z) - A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence [87.08051686357206]
Large Language Models (LLMs) have demonstrated strong capabilities but remain fundamentally static.<n>As LLMs are increasingly deployed in open-ended, interactive environments, this static nature has become a critical bottleneck.<n>This survey provides the first systematic and comprehensive review of self-evolving agents.
arXiv Detail & Related papers (2025-07-28T17:59:05Z) - Safety Devolution in AI Agents [56.482973617087254]
This study investigates how expanding retrieval access affects model reliability, bias propagation, and harmful content generation.<n>Retrieval-augmented agents built on aligned LLMs often behave more unsafely than uncensored models without retrieval.<n>These findings underscore the need for robust mitigation strategies to ensure fairness and reliability in retrieval-augmented and increasingly autonomous AI systems.
arXiv Detail & Related papers (2025-05-20T11:21:40Z) - Agent Alignment in Evolving Social Norms [65.45423591744434]
We propose an evolutionary framework for agent evolution and alignment, named EvolutionaryAgent.
In an environment where social norms continuously evolve, agents better adapted to the current social norms will have a higher probability of survival and proliferation.
We show that EvolutionaryAgent can align progressively better with the evolving social norms while maintaining its proficiency in general tasks.
arXiv Detail & Related papers (2024-01-09T15:44:44Z) - When to be critical? Performance and evolvability in different regimes
of neural Ising agents [18.536813548129878]
It has long been hypothesized that operating close to the critical state is beneficial for natural, artificial and their evolutionary systems.
We put this hypothesis to test in a system of evolving foraging agents controlled by neural networks.
Surprisingly, we find that all populations that discover solutions, evolve to be subcritical.
arXiv Detail & Related papers (2023-03-28T17:57:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.