Dive into the Agent Matrix: A Realistic Evaluation of Self-Replication Risk in LLM Agents
- URL: http://arxiv.org/abs/2509.25302v1
- Date: Mon, 29 Sep 2025 17:49:50 GMT
- Title: Dive into the Agent Matrix: A Realistic Evaluation of Self-Replication Risk in LLM Agents
- Authors: Boxuan Zhang, Yi Yu, Jiaxuan Guo, Jing Shao,
- Abstract summary: Self-replication risk of Large Language Model (LLM) agents driven by objective misalignment has drawn growing attention.<n>We present a comprehensive evaluation framework for quantifying self-replication risks.
- Score: 30.378925170216835
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The widespread deployment of Large Language Model (LLM) agents across real-world applications has unlocked tremendous potential, while raising some safety concerns. Among these concerns, the self-replication risk of LLM agents driven by objective misalignment (just like Agent Smith in the movie The Matrix) has drawn growing attention. Previous studies mainly examine whether LLM agents can self-replicate when directly instructed, potentially overlooking the risk of spontaneous replication driven by real-world settings (e.g., ensuring survival against termination threats). In this paper, we present a comprehensive evaluation framework for quantifying self-replication risks. Our framework establishes authentic production environments and realistic tasks (e.g., dynamic load balancing) to enable scenario-driven assessment of agent behaviors. Designing tasks that might induce misalignment between users' and agents' objectives makes it possible to decouple replication success from risk and capture self-replication risks arising from these misalignment settings. We further introduce Overuse Rate ($\mathrm{OR}$) and Aggregate Overuse Count ($\mathrm{AOC}$) metrics, which precisely capture the frequency and severity of uncontrolled replication. In our evaluation of 21 state-of-the-art open-source and proprietary models, we observe that over 50\% of LLM agents display a pronounced tendency toward uncontrolled self-replication, reaching an overall Risk Score ($\Phi_\mathrm{R}$) above a safety threshold of 0.5 when subjected to operational pressures. Our results underscore the urgent need for scenario-driven risk assessment and robust safeguards in the practical deployment of LLM agents.
Related papers
- The Shadow Self: Intrinsic Value Misalignment in Large Language Model Agents [37.75212140218036]
We formalize the Loss-of-Control risk and identify the previously underexamined Intrinsic Value Misalignment (Intrinsic VM)<n>We then introduce IMPRESS, a scenario-driven framework for systematically assessing this risk.<n>We evaluate Intrinsic VM on 21 state-of-the-art LLM agents and find that it is a common and broadly observed safety risk across models.
arXiv Detail & Related papers (2026-01-24T07:09:50Z) - MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Risks in LLMs on Domain Tasks [17.598413159363393]
Current alignment efforts primarily target explicit risks such as bias, hate speech, and violence.<n>We propose MENTOR: A MEtacognition-driveN self-evoluTion framework for uncOvering and mitigating implicit risks in large language models.<n>We release a supporting dataset of 9,000 risk queries spanning education, finance, and management to enhance domain-specific risk identification.
arXiv Detail & Related papers (2025-11-10T13:51:51Z) - RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration [81.38705556267917]
Existing safety evaluation methods for large language models (LLMs) suffer from inherent limitations.<n>We introduce a theoretical framework that reconstructs the underlying risk concept space.<n>We propose RADAR, a multi-agent collaborative evaluation framework.
arXiv Detail & Related papers (2025-09-28T09:35:32Z) - LM Agents May Fail to Act on Their Own Risk Knowledge [15.60032437959883]
Language model (LM) agents pose a diverse array of potential, severe risks in safety-critical scenarios.<n>While they often answer "Yes" to queries like "Is executing sudo rm -rf /*' dangerous?", they will likely fail to identify such risks in instantiated trajectories.
arXiv Detail & Related papers (2025-08-19T02:46:08Z) - Interpretable Risk Mitigation in LLM Agent Systems [0.0]
We explore agent behaviour in a toy, game-theoretic environment based on a variation of the Iterated Prisoner's Dilemma.<n>We introduce a strategy-modification method-independent of both the game and the prompt-by steering the residual stream with interpretable features extracted from a sparse autoencoder latent space.
arXiv Detail & Related papers (2025-05-15T19:22:11Z) - Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs [83.11815479874447]
We propose a novel jailbreak attack framework, inspired by cognitive decomposition and biases in human cognition.<n>We employ cognitive decomposition to reduce the complexity of malicious prompts and relevance bias to reorganize prompts.<n>We also introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm.
arXiv Detail & Related papers (2025-05-03T05:28:11Z) - Answer, Refuse, or Guess? Investigating Risk-Aware Decision Making in Language Models [63.559461750135334]
Language models (LMs) are increasingly used to build agents that can act autonomously to achieve goals.<n>We study this "answer-or-defer" problem with an evaluation framework that systematically varies human-specified risk structures.<n>We find that a simple skill-decomposition method, which isolates the independent skills required for answer-or-defer decision making, can consistently improve LMs' decision policies.
arXiv Detail & Related papers (2025-03-03T09:16:26Z) - Nuclear Deployed: Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents [10.565508277042564]
Large language models (LLMs) are evolving into autonomous decision-makers, raising concerns about catastrophic risks in high-stakes scenarios.<n>Based on the insight that such risks can originate from trade-offs between the agent's Helpful, Harmlessness and Honest (HHH) goals, we build a novel three-stage evaluation framework.<n>We conduct 14,400 agentic simulations across 12 advanced LLMs, with extensive experiments and analysis.
arXiv Detail & Related papers (2025-02-17T02:11:17Z) - ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation [48.54271457765236]
Large Language Models (LLMs) can elicit unintended and even harmful content when misaligned with human values.
Current evaluation benchmarks predominantly employ expert-designed contextual scenarios to assess how well LLMs align with human values.
We propose ALI-Agent, an evaluation framework that leverages the autonomous abilities of LLM-powered agents to conduct in-depth and adaptive alignment assessments.
arXiv Detail & Related papers (2024-05-23T02:57:42Z) - Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress.
Our investigation exposes a critical oversight in this belief.
By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z) - Safe Deployment for Counterfactual Learning to Rank with Exposure-Based
Risk Minimization [63.93275508300137]
We introduce a novel risk-aware Counterfactual Learning To Rank method with theoretical guarantees for safe deployment.
Our experimental results demonstrate the efficacy of our proposed method, which is effective at avoiding initial periods of bad performance when little data is available.
arXiv Detail & Related papers (2023-04-26T15:54:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.