Related papers: The Forgotten Shield: Safety Grafting in Parameter-Space for Medical MLLMs

The Forgotten Shield: Safety Grafting in Parameter-Space for Medical MLLMs

URL: http://arxiv.org/abs/2601.04199v1
Date: Fri, 05 Dec 2025 06:52:06 GMT
Title: The Forgotten Shield: Safety Grafting in Parameter-Space for Medical MLLMs
Authors: Jiale Zhao, Xing Mou, Jinlin Wu, Hongyuan Yu, Mingrui Sun, Yang Shi, Xuanwu Yin, Zhen Chen, Zhen Lei, Yaohua Wang,
Abstract summary: Medical Multimodal Large Language Models (Medical MLLMs) have achieved remarkable progress in specialized medical tasks.<n>However, research into their safety has lagged, posing potential risks for real-world deployment.<n>We first establish a multidimensional evaluation framework to systematically benchmark the safety of current SOTA Medical MLLMs.
Score: 23.79442915729949
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Medical Multimodal Large Language Models (Medical MLLMs) have achieved remarkable progress in specialized medical tasks; however, research into their safety has lagged, posing potential risks for real-world deployment. In this paper, we first establish a multidimensional evaluation framework to systematically benchmark the safety of current SOTA Medical MLLMs. Our empirical analysis reveals pervasive vulnerabilities across both general and medical-specific safety dimensions in existing models, particularly highlighting their fragility against cross-modality jailbreak attacks. Furthermore, we find that the medical fine-tuning process frequently induces catastrophic forgetting of the model's original safety alignment. To address this challenge, we propose a novel "Parameter-Space Intervention" approach for efficient safety re-alignment. This method extracts intrinsic safety knowledge representations from original base models and concurrently injects them into the target model during the construction of medical capabilities. Additionally, we design a fine-grained parameter search algorithm to achieve an optimal trade-off between safety and medical performance. Experimental results demonstrate that our approach significantly bolsters the safety guardrails of Medical MLLMs without relying on additional domain-specific safety data, while minimizing degradation to core medical performance.

Related papers

SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond [134.43113804188195]
We introduce SafeSci, a comprehensive framework for safety evaluation and enhancement in scientific contexts.<n>SafeSci comprises SafeSciBench, a multi-disciplinary benchmark with 0.25M samples, and SafeSciTrain, a large-scale dataset containing 1.5M samples for safety enhancement.
arXiv Detail & Related papers (2026-03-02T08:16:04Z)
A Practical Framework for Evaluating Medical AI Security: Reproducible Assessment of Jailbreaking and Privacy Vulnerabilities Across Clinical Specialties [11.500745861209774]
Medical Large Language Models (LLMs) are increasingly deployed for clinical decision support across diverse specialties.<n>Existing security benchmarks require GPU clusters, commercial API access, or protected health data.<n>We propose a practical, fully reproducible framework for evaluating medical AI security under realistic resource constraints.
arXiv Detail & Related papers (2025-12-09T02:28:15Z)
Medical Red Teaming Protocol of Language Models: On the Importance of User Perspectives in Healthcare Settings [48.096652370210016]
We introduce a safety evaluation protocol tailored to the medical domain in both patient user and clinician user perspectives.<n>This is the first work to define safety evaluation criteria for medical LLMs through targeted red-teaming taking three different points of view.
arXiv Detail & Related papers (2025-07-09T19:38:58Z)
Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations [13.977100716044104]
We propose a novel inference-time defense strategy to mitigate harmful queries.<n>We show that our strategy enhances model safety without significantly compromising performance.<n>We then introduce a mixed demonstration strategy as a trade-off solution for balancing security and performance.
arXiv Detail & Related papers (2025-06-08T16:26:51Z)
Saffron-1: Safety Inference Scaling [69.61130284742353]
SAFFRON is a novel inference scaling paradigm tailored explicitly for safety assurance.<n>Central to our approach is the introduction of a multifurcation reward model (MRM) that significantly reduces the required number of reward model evaluations.<n>We publicly release our trained multifurcation reward model (Saffron-1) and the accompanying token-level safety reward dataset (Safety4M)
arXiv Detail & Related papers (2025-06-06T18:05:45Z)
Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation [52.83870601473094]
Embodied agents exhibit immense potential across a multitude of domains.<n>Existing research predominantly concentrates on the security of general large language models.<n>This paper introduces a novel input moderation framework, meticulously designed to safeguard embodied agents.
arXiv Detail & Related papers (2025-04-22T08:34:35Z)
Towards Safe AI Clinicians: A Comprehensive Study on Large Language Model Jailbreaking in Healthcare [15.438265972219869]
Large language models (LLMs) are increasingly utilized in healthcare applications.<n>This study systematically assesses the vulnerabilities of seven LLMs to three advanced black-box jailbreaking techniques.
arXiv Detail & Related papers (2025-01-27T22:07:52Z)
EARBench: Towards Evaluating Physical Risk Awareness for Task Planning of Foundation Model-based Embodied AI Agents [53.717918131568936]
Embodied artificial intelligence (EAI) integrates advanced AI models into physical entities for real-world interaction.<n>Foundation models as the "brain" of EAI agents for high-level task planning have shown promising results.<n>However, the deployment of these agents in physical environments presents significant safety challenges.<n>This study introduces EARBench, a novel framework for automated physical risk assessment in EAI scenarios.
arXiv Detail & Related papers (2024-08-08T13:19:37Z)
Adversarial Attacks on Large Language Models in Medicine [34.17895005922139]
The integration of Large Language Models into healthcare applications offers promising advancements in medical diagnostics, treatment recommendations, and patient care.<n>The susceptibility of LLMs to adversarial attacks poses a significant threat, potentially leading to harmful outcomes in delicate medical contexts.<n>This study investigates the vulnerability of LLMs to two types of adversarial attacks in three medical tasks.
arXiv Detail & Related papers (2024-06-18T04:24:30Z)
Medical MLLM is Vulnerable: Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models [9.860799633304298]
This paper delves into the underexplored security vulnerabilities of MedMLLMs. By combining existing clinical medical data with atypical natural phenomena, we define the mismatched malicious attack. We propose the MCM optimization method, which significantly enhances the attack success rate on MedMLLMs.
arXiv Detail & Related papers (2024-05-26T19:11:21Z)
Privacy-preserving medical image analysis [53.4844489668116]
We present PriMIA, a software framework designed for privacy-preserving machine learning (PPML) in medical imaging. We show significantly better classification performance of a securely aggregated federated learning model compared to human experts on unseen datasets. We empirically evaluate the framework's security against a gradient-based model inversion attack.
arXiv Detail & Related papers (2020-12-10T13:56:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.