Related papers: MedSentry: Understanding and Mitigating Safety Risks in Medical LLM Multi-Agent Systems

MedSentry: Understanding and Mitigating Safety Risks in Medical LLM Multi-Agent Systems

URL: http://arxiv.org/abs/2505.20824v1
Date: Tue, 27 May 2025 07:34:40 GMT
Title: MedSentry: Understanding and Mitigating Safety Risks in Medical LLM Multi-Agent Systems
Authors: Kai Chen, Taihang Zhen, Hewei Wang, Kailai Liu, Xinfeng Li, Jing Huo, Tianpei Yang, Jinfeng Xu, Wei Dong, Yang Gao,
Abstract summary: We introduce MedSentry, a benchmark 5 000 adversarial medical prompts spanning 25 categories with 100 subthemes.<n>We develop an end-to-end attack-defense evaluation pipeline to analyze how four representative multi-agent topologies withstand attacks from 'dark-personality' agents.
Score: 24.60202452646343
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) are increasingly deployed in healthcare, ensuring their safety, particularly within collaborative multi-agent configurations, is paramount. In this paper we introduce MedSentry, a benchmark comprising 5 000 adversarial medical prompts spanning 25 threat categories with 100 subthemes. Coupled with this dataset, we develop an end-to-end attack-defense evaluation pipeline to systematically analyze how four representative multi-agent topologies (Layers, SharedPool, Centralized, and Decentralized) withstand attacks from 'dark-personality' agents. Our findings reveal critical differences in how these architectures handle information contamination and maintain robust decision-making, exposing their underlying vulnerability mechanisms. For instance, SharedPool's open information sharing makes it highly susceptible, whereas Decentralized architectures exhibit greater resilience thanks to inherent redundancy and isolation. To mitigate these risks, we propose a personality-scale detection and correction mechanism that identifies and rehabilitates malicious agents, restoring system safety to near-baseline levels. MedSentry thus furnishes both a rigorous evaluation framework and practical defense strategies that guide the design of safer LLM-based multi-agent systems in medical domains.

Related papers

OMNI-LEAK: Orchestrator Multi-Agent Network Induced Data Leakage [59.3826294523924]
We investigate the security vulnerabilities of a popular multi-agent pattern known as the orchestrator setup.<n>We report the susceptibility of frontier models to different categories of attacks, finding that both reasoning and non-reasoning models are vulnerable.
arXiv Detail & Related papers (2026-02-13T21:32:32Z)
INFA-Guard: Mitigating Malicious Propagation via Infection-Aware Safeguarding in LLM-Based Multi-Agent Systems [70.37731999972785]
In this paper, we propose Infection-Aware Guard, INFA-Guard, a novel defense framework that explicitly identifies and addresses infected agents as a distinct threat category.<n>During remediation, INFA-Guard replaces attackers and rehabilitates infected ones, avoiding malicious propagation while preserving topological integrity.
arXiv Detail & Related papers (2026-01-21T05:27:08Z)
Data Poisoning Vulnerabilities Across Healthcare AI Architectures: A Security Threat Analysis [39.89241412792336]
We analyzed eight attack scenarios in four categories: architectural attacks on convolutional neural networks, large language models, and reinforcement learning agents.<n>Our findings indicate that attackers with access to only 100-500 samples can compromise healthcare AI regardless of dataset size.<n>We recommend multilayer defenses including required adversarial testing, ensemble-based detection, privacy-preserving security mechanisms, and international coordination on AI security standards.
arXiv Detail & Related papers (2025-11-14T07:16:16Z)
Exposing Weak Links in Multi-Agent Systems under Adversarial Prompting [5.544819942438653]
We present SafeAgents, a framework for fine-grained security assessment of multi-agent systems.<n>We conduct a study across five widely adopted multi-agent architectures.<n>Our findings reveal that common design patterns carry significant vulnerabilities.
arXiv Detail & Related papers (2025-11-14T04:22:49Z)
Can an Individual Manipulate the Collective Decisions of Multi-Agents? [53.01767232004823]
M-Spoiler is a framework that simulates agent interactions within a multi-agent system to generate adversarial samples.<n>M-Spoiler introduces a stubborn agent that actively aids in optimizing adversarial samples.<n>Our findings confirm the risks posed by the knowledge of an individual agent in multi-agent systems.
arXiv Detail & Related papers (2025-09-20T01:54:20Z)
Vulnerable Agent Identification in Large-Scale Multi-Agent Reinforcement Learning [49.31650627835956]
Partial agent failure becomes inevitable when systems scale up, making it crucial to identify the subset of agents whose compromise would most severely degrade overall performance.<n>In this paper, we study this Vulnerable Agent Identification (VAI) problem in large-scale multi-agent reinforcement learning (MARL)<n> Experiments show our method effectively identifies more vulnerable agents in large-scale MARL and the rule-based system, fooling system into worse failures, and learning a value function that reveals the vulnerability of each agent.
arXiv Detail & Related papers (2025-09-18T16:03:50Z)
How to make Medical AI Systems safer? Simulating Vulnerabilities, and Threats in Multimodal Medical RAG System [21.40560864239872]
We propose MedThreatRAG, a novel framework that probes vulnerabilities in medical RAG systems.<n>A key innovation of our approach is the construction of a simulated semi-open attack environment.<n>We show that MedThreatRAG reduces answer F1 scores by up to 27.66% and lowers LLaVA-Med-1.5 F1 rates to as low as 51.36%.
arXiv Detail & Related papers (2025-08-24T05:11:09Z)
Risk Analysis Techniques for Governed LLM-based Multi-Agent Systems [0.0]
This report addresses the early stages of risk identification and analysis for multi-agent AI systems.<n>We examine six critical failure modes: cascading reliability failures, inter-agent communication failures, monoculture collapse, conformity bias, deficient theory of mind, and mixed motive dynamics.
arXiv Detail & Related papers (2025-08-06T06:06:57Z)
Towards Unifying Quantitative Security Benchmarking for Multi Agent Systems [0.0]
Evolving AI systems increasingly deploy multi-agent architectures where autonomous agents collaborate, share information, and delegate tasks through developing protocols.<n>One such risk is a cascading risk: a breach in one agent can cascade through the system, compromising others by exploiting inter-agent trust.<n>In an ACI attack, a malicious input or tool exploit injected at one agent leads to cascading compromises and amplified downstream effects across agents that trust its outputs.
arXiv Detail & Related papers (2025-07-23T13:51:28Z)
Who's the Mole? Modeling and Detecting Intention-Hiding Malicious Agents in LLM-Based Multi-Agent Systems [15.843105510334388]
We investigate intention-hiding threats in multi-agent systems powered by Large Language Models (LLM-MAS)<n>We propose a psychology-based detection framework AgentXposed, which combines the HEXACO personality model with the Reid Technique.<n>Our findings reveal the structural and behavioral risks posed by intention-hiding attacks and offer valuable insights into securing LLM-based multi-agent systems.
arXiv Detail & Related papers (2025-07-07T07:34:34Z)
SafeMobile: Chain-level Jailbreak Detection and Automated Evaluation for Multimodal Mobile Agents [58.21223208538351]
This work explores the security issues surrounding mobile multimodal agents.<n>It attempts to construct a risk discrimination mechanism by incorporating behavioral sequence information.<n>It also designs an automated assisted assessment scheme based on a large language model.
arXiv Detail & Related papers (2025-07-01T15:10:00Z)
LLM Agents Should Employ Security Principles [60.03651084139836]
This paper argues that the well-established design principles in information security should be employed when deploying Large Language Model (LLM) agents at scale.<n>We introduce AgentSandbox, a conceptual framework embedding these security principles to provide safeguards throughout an agent's life-cycle.
arXiv Detail & Related papers (2025-05-29T21:39:08Z)
SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models [50.34706204154244]
Acquiring reasoning capabilities catastrophically degrades inherited safety alignment.<n>Certain scenarios suffer 25 times higher attack rates.<n>Despite tight reasoning-answer safety coupling, MLRMs demonstrate nascent self-correction.
arXiv Detail & Related papers (2025-04-09T06:53:23Z)
Poisoned-MRAG: Knowledge Poisoning Attacks to Multimodal Retrieval Augmented Generation [71.32665836294103]
Multimodal retrieval-augmented generation (RAG) enhances the visual reasoning capability of vision-language models (VLMs)<n>In this work, we introduce textitPoisoned-MRAG, the first knowledge poisoning attack on multimodal RAG systems.
arXiv Detail & Related papers (2025-03-08T15:46:38Z)
Multi-Agent Security Tax: Trading Off Security and Collaboration Capabilities in Multi-Agent Systems [1.2564343689544843]
We develop simulations of AI agents collaborating on shared objectives to study security risks and trade-offs.<n>We observe infectious malicious prompts - the multi-hop spreading of malicious instructions.<n>Our findings illustrate potential trade-off between security and collaborative efficiency in multi-agent systems.
arXiv Detail & Related papers (2025-02-26T14:00:35Z)
Multi-Agent Risks from Advanced AI [90.74347101431474]
Multi-agent systems of advanced AI pose novel and under-explored risks.<n>We identify three key failure modes based on agents' incentives, as well as seven key risk factors.<n>We highlight several important instances of each risk, as well as promising directions to help mitigate them.
arXiv Detail & Related papers (2025-02-19T23:03:21Z)
EARBench: Towards Evaluating Physical Risk Awareness for Task Planning of Foundation Model-based Embodied AI Agents [53.717918131568936]
Embodied artificial intelligence (EAI) integrates advanced AI models into physical entities for real-world interaction.<n>Foundation models as the "brain" of EAI agents for high-level task planning have shown promising results.<n>However, the deployment of these agents in physical environments presents significant safety challenges.<n>This study introduces EARBench, a novel framework for automated physical risk assessment in EAI scenarios.
arXiv Detail & Related papers (2024-08-08T13:19:37Z)
Adversarial Attacks on Large Language Models in Medicine [34.17895005922139]
The integration of Large Language Models into healthcare applications offers promising advancements in medical diagnostics, treatment recommendations, and patient care.<n>The susceptibility of LLMs to adversarial attacks poses a significant threat, potentially leading to harmful outcomes in delicate medical contexts.<n>This study investigates the vulnerability of LLMs to two types of adversarial attacks in three medical tasks.
arXiv Detail & Related papers (2024-06-18T04:24:30Z)
Medical MLLM is Vulnerable: Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models [9.860799633304298]
This paper delves into the underexplored security vulnerabilities of MedMLLMs. By combining existing clinical medical data with atypical natural phenomena, we define the mismatched malicious attack. We propose the MCM optimization method, which significantly enhances the attack success rate on MedMLLMs.
arXiv Detail & Related papers (2024-05-26T19:11:21Z)
PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety [70.84902425123406]
Multi-agent systems, when enhanced with Large Language Models (LLMs), exhibit profound capabilities in collective intelligence. However, the potential misuse of this intelligence for malicious purposes presents significant risks. We propose a framework (PsySafe) grounded in agent psychology, focusing on identifying how dark personality traits in agents can lead to risky behaviors. Our experiments reveal several intriguing phenomena, such as the collective dangerous behaviors among agents, agents' self-reflection when engaging in dangerous behavior, and the correlation between agents' psychological assessments and dangerous behaviors.
arXiv Detail & Related papers (2024-01-22T12:11:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.