Related papers: AM$^3$Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs

AM$^3$Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs

URL: http://arxiv.org/abs/2601.04736v1
Date: Thu, 08 Jan 2026 08:57:05 GMT
Title: AM$^3$Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs
Authors: Han Zhu, Jiale Chen, Chengkun Cai, Shengjie Sun, Haoran Li, Yujin Zhou, Chi-Min Chan, Pengcheng Wen, Lei Li, Sirui Han, Yike Guo,
Abstract summary: We present InterSafe-V, an open-source multi-modal dialogue dataset containing 11,270 dialogues and 500 specially designed refusal VQA samples.<n>We propose AM$3$Safety, a framework that combines a cold-start refusal phase with Group Relative Policy Optimization.<n>Experiments on Qwen2.5-VL-7B-Instruct and LLaVA-NeXT-7B show more than 10% decrease in Attack Success Rate.
Score: 30.026306656765314
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-modal Large Language Models (MLLMs) are increasingly deployed in interactive applications. However, their safety vulnerabilities become pronounced in multi-turn multi-modal scenarios, where harmful intent can be gradually reconstructed across turns, and security protocols fade into oblivion as the conversation progresses. Existing Reinforcement Learning from Human Feedback (RLHF) alignment methods are largely developed for single-turn visual question-answer (VQA) task and often require costly manual preference annotations, limiting their effectiveness and scalability in dialogues. To address this challenge, we present InterSafe-V, an open-source multi-modal dialogue dataset containing 11,270 dialogues and 500 specially designed refusal VQA samples. This dataset, constructed through interaction between several models, is designed to more accurately reflect real-world scenarios and includes specialized VQA pairs tailored for specific domains. Building on this dataset, we propose AM$^3$Safety, a framework that combines a cold-start refusal phase with Group Relative Policy Optimization (GRPO) fine-tuning using turn-aware dual-objective rewards across entire dialogues. Experiments on Qwen2.5-VL-7B-Instruct and LLaVA-NeXT-7B show more than 10\% decrease in Attack Success Rate (ASR) together with an increment of at least 8\% in harmless dimension and over 13\% in helpful dimension of MLLMs on multi-modal multi-turn safety benchmarks, while preserving their general abilities.

Related papers

MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models [17.848889547838173]
MUSE (Multimodal Unified Safety Evaluation) is an open-source, run-centric platform that integrates automatic cross-modal payload generation.<n>A dual-metric framework distinguishes hard Attack Success Rate (Compliance only) from soft ASR (including Partial Compliance)<n>Experiments show that multi-turn strategies can achieve up to 90-100% ASR against models with near-perfect single-turn refusal.
arXiv Detail & Related papers (2026-03-03T00:10:23Z)
SafeMT: Multi-turn Safety for Multimodal Language Models [42.59582247058264]
We introduce SafeMT, a benchmark that features dialogues of varying lengths generated from harmful queries accompanied by images.<n>This benchmark consists of 10,000 samples in total, encompassing 17 different scenarios and four jailbreak methods.<n>We assess the safety of 17 models using this benchmark and discover that the risk of successful attacks on these models increases as the number of turns in harmful dialogues rises.<n>We propose a dialogue safety moderator capable of detecting malicious intent concealed within conversations and providing MLLMs with relevant safety policies.
arXiv Detail & Related papers (2025-10-14T04:24:07Z)
LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models [1.4923957493548121]
Malicious intent can be spread across turns and images in Multimodal Multi-Turn (MMT) dialogues.<n>We present the first systematic definition and study of MMT dialogue safety.<n>We develop an automated multimodal multi-turn red-teaming framework to generate unsafe multi-turn dialogues for MMDS.<n>We present LLaVAShield, a powerful tool that jointly detects and assesses risk in user inputs and assistant responses.
arXiv Detail & Related papers (2025-09-30T07:42:23Z)
When Safe Unimodal Inputs Collide: Optimizing Reasoning Chains for Cross-Modal Safety in Multimodal Large Language Models [50.66979825532277]
We introduce Safe-Semantics-but-Unsafe-Interpretation (SSUI), the first dataset featuring interpretable reasoning paths tailored for a cross-modal challenge.<n>A novel training framework, Safety-aware Reasoning Path Optimization (SRPO), is also designed based on the SSUI dataset.<n> Experimental results show that our SRPO-trained models achieve state-of-the-art results on key safety benchmarks.
arXiv Detail & Related papers (2025-09-15T15:40:58Z)
Automating Steering for Safe Multimodal Large Language Models [58.36932318051907]
We introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model.<n>AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected.
arXiv Detail & Related papers (2025-07-17T16:04:55Z)
Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models [83.80177564873094]
We propose a unified multimodal universal jailbreak attack framework.<n>We evaluate the undesirable context generation of MLLMs like LLaVA, Yi-VL, MiniGPT4, MiniGPT-v2, and InstructBLIP.<n>This study underscores the urgent need for robust safety measures in MLLMs.
arXiv Detail & Related papers (2025-06-02T04:33:56Z)
SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues [9.762621950740995]
Malicious attackers can exploit large language models (LLMs) by engaging them in multi-turn dialogues.<n>We propose a novel defense mechanism: SafeTy Reasoning Elicitation Alignment for Multi-Turn Dialogues (STREAM)
arXiv Detail & Related papers (2025-05-31T18:38:23Z)
Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation [88.78166077081912]
We introduce a multimodal unlearning benchmark, UnLOK-VQA, and an attack-and-defense framework to evaluate methods for deleting specific multimodal knowledge from MLLMs.<n>Our results show multimodal attacks outperform text- or image-only ones, and that the most effective defense removes answer information from internal model states.
arXiv Detail & Related papers (2025-05-01T01:54:00Z)
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited.<n>We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z)
UC-MOA: Utility-Conditioned Multi-Objective Alignment for Distributional Pareto-Optimality [52.49062565901046]
Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone for aligning large language models with human values.<n>Existing approaches struggle to capture the multi-dimensional, distributional nuances of human preferences.<n>We introduce Utility-Conditioned Multi-Objective Alignment (UC-MOA), a novel framework that overcomes these limitations.
arXiv Detail & Related papers (2025-03-10T09:52:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.