Related papers: Beyond Single-Agent Safety: A Taxonomy of Risks in LLM-to-LLM Interactions

Beyond Single-Agent Safety: A Taxonomy of Risks in LLM-to-LLM Interactions

URL: http://arxiv.org/abs/2512.02682v1
Date: Tue, 02 Dec 2025 12:06:57 GMT
Title: Beyond Single-Agent Safety: A Taxonomy of Risks in LLM-to-LLM Interactions
Authors: Piercosma Bisconti, Marcello Galisai, Federico Pierucci, Marcantonio Bracale, Matteo Prandi,
Abstract summary: This paper examines why safety mechanisms designed for human-model interaction do not scale to environments where large language models interact with each other.<n>We propose a conceptual transition from model-level safety to system-level safety, introducing the framework of the Emergent Systemic Risk Horizon (ESRH)<n>The paper contributes (i) a theoretical account of collective risk in interacting LLMs, (ii) a taxonomy connecting micro, meso, and macro-level failure modes, and (iii) a design proposal for InstitutionalAI, an architecture for embedding adaptive oversight within multi-agent systems.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper examines why safety mechanisms designed for human-model interaction do not scale to environments where large language models (LLMs) interact with each other. Most current governance practices still rely on single-agent safety containment, prompts, fine-tuning, and moderation layers that constrain individual model behavior but leave the dynamics of multi-model interaction ungoverned. These mechanisms assume a dyadic setting: one model responding to one user under stable oversight. Yet research and industrial development are rapidly shifting toward LLM-to-LLM ecosystems, where outputs are recursively reused as inputs across chains of agents. In such systems, local compliance can aggregate into collective failure even when every model is individually aligned. We propose a conceptual transition from model-level safety to system-level safety, introducing the framework of the Emergent Systemic Risk Horizon (ESRH) to formalize how instability arises from interaction structure rather than from isolated misbehavior. The paper contributes (i) a theoretical account of collective risk in interacting LLMs, (ii) a taxonomy connecting micro, meso, and macro-level failure modes, and (iii) a design proposal for InstitutionalAI, an architecture for embedding adaptive oversight within multi-agent systems.

Related papers

From Spark to Fire: Modeling and Mitigating Error Cascades in LLM-Based Multi-Agent Collaboration [27.233204826914243]
Collaborative mechanisms may cause minor inaccuracies to solidify into system-level false consensus through iteration.<n>Existing protections often rely on single-agent validation or require modifications to the collaboration architecture.<n>We propose a propagation dynamics model tailored for LLM-MAS that abstracts collaboration as a directed dependency graph.
arXiv Detail & Related papers (2026-03-04T11:45:27Z)
The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies [57.387081435669835]
Multi-agent systems built from large language models offer a promising paradigm for scalable collective intelligence and self-evolution.<n>We show that an agent society satisfying continuous self-evolution, complete isolation, and safety invariance is impossible.<n>We propose several solution directions to alleviate the identified safety concern.
arXiv Detail & Related papers (2026-02-10T15:18:19Z)
Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs [65.6660735371212]
We present textbftextscJustAsk, a framework that autonomously discovers effective extraction strategies through interaction alone.<n>It formulates extraction as an online exploration problem, using Upper Confidence Bound--based strategy selection and a hierarchical skill space spanning atomic probes and high-level orchestration.<n>Our results expose system prompts as a critical yet largely unprotected attack surface in modern agent systems.
arXiv Detail & Related papers (2026-01-29T03:53:25Z)
Norm-Governed Multi-Agent Decision-Making in Simulator-Coupled Environments:The Reinsurance Constrained Multi-Agent Simulation Process (R-CMASP) [0.0]
We show how a domain-calibrated synthetic environment that governed multi-agent coordination yields more stable, coherent, and norm-adherent behaviour than deterministic automation.<n>Results suggest that regulated, simulator-driven decision environments are most naturally modelled as norm-governed, simulator-coupled multi-agent systems.
arXiv Detail & Related papers (2025-12-04T10:30:26Z)
Learning from Risk: LLM-Guided Generation of Safety-Critical Scenarios with Prior Knowledge [25.50999678115561]
This paper presents a high-fidelity scenario generation framework that integrates a conditional variational autoencoder (CVAE) with a large language model (LLM)<n>Our framework substantially increases the coverage of high-risk and long-tail events, improves consistency between simulated and real-world traffic distributions, and exposes autonomous driving systems to interactions that are significantly more challenging than those produced by existing rule- or data-driven methods.
arXiv Detail & Related papers (2025-11-25T09:53:09Z)
Agentic Moderation: Multi-Agent Design for Safer Vision-Language Models [17.28260523039769]
We introduce Agentic Moderation, a model-agnostic framework that leverages specialised agents to defend multimodal systems against jailbreak attacks.<n>Our approach reduces the Attack Success Rate (ASR) by 7-19%, maintains a stable Non-Following Rate (NF) and improves the Refusal Rate (RR) by 4-20%.<n>By harnessing the flexibility and reasoning capacity of agentic architectures, Agentic Moderation provides modular, scalable, and fine-grained safety enforcement.
arXiv Detail & Related papers (2025-10-29T05:23:24Z)
Social World Model-Augmented Mechanism Design Policy Learning [58.739456918502704]
We introduce SWM-AP (Social World Model-Augmented Mechanism Design Policy Learning), which learns a social world model hierarchically to enhance mechanism design.<n>We show that SWM-AP outperforms established model-based and model-free RL baselines in cumulative rewards and sample efficiency.
arXiv Detail & Related papers (2025-10-22T06:01:21Z)
Risk Analysis Techniques for Governed LLM-based Multi-Agent Systems [0.0]
This report addresses the early stages of risk identification and analysis for multi-agent AI systems.<n>We examine six critical failure modes: cascading reliability failures, inter-agent communication failures, monoculture collapse, conformity bias, deficient theory of mind, and mixed motive dynamics.
arXiv Detail & Related papers (2025-08-06T06:06:57Z)
Automating Steering for Safe Multimodal Large Language Models [58.36932318051907]
We introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model.<n>AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected.
arXiv Detail & Related papers (2025-07-17T16:04:55Z)
A Survey on Autonomy-Induced Security Risks in Large Model-Based Agents [45.53643260046778]
Recent advances in large language models (LLMs) have catalyzed the rise of autonomous AI agents.<n>These large-model agents mark a paradigm shift from static inference systems to interactive, memory-augmented entities.
arXiv Detail & Related papers (2025-06-30T13:34:34Z)
Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge [1.1666234644810893]
Small models outperform larger ones in safety, suggesting that training and architecture may matter more than scale.<n>No model is fully robust to adversarial elicitation, with jailbreak attacks using low-resource languages or refusal suppression proving effective.
arXiv Detail & Related papers (2025-04-10T16:00:59Z)
MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks [85.3303135160762]
MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models.<n>It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines.<n>We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
arXiv Detail & Related papers (2025-03-24T20:38:42Z)
Risk-Sensitive Sequential Action Control with Multi-Modal Human Trajectory Forecasting for Safe Crowd-Robot Interaction [55.569050872780224]
We present an online framework for safe crowd-robot interaction based on risk-sensitive optimal control, wherein the risk is modeled by the entropic risk measure. Our modular approach decouples the crowd-robot interaction into learning-based prediction and model-based control. A simulation study and a real-world experiment show that the proposed framework can accomplish safe and efficient navigation while avoiding collisions with more than 50 humans in the scene.
arXiv Detail & Related papers (2020-09-12T02:02:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.