LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models
- URL: http://arxiv.org/abs/2509.25896v2
- Date: Wed, 01 Oct 2025 21:07:48 GMT
- Title: LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models
- Authors: Guolei Huang, Qinzhi Peng, Gan Xu, Yuxuan Lu, Yongjun Shen,
- Abstract summary: Malicious intent can be spread across turns and images in Multimodal Multi-Turn (MMT) dialogues.<n>We present the first systematic definition and study of MMT dialogue safety.<n>We develop an automated multimodal multi-turn red-teaming framework to generate unsafe multi-turn dialogues for MMDS.<n>We present LLaVAShield, a powerful tool that jointly detects and assesses risk in user inputs and assistant responses.
- Score: 1.4923957493548121
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As Vision-Language Models (VLMs) move into interactive, multi-turn use, new safety risks arise that single-turn or single-modality moderation misses. In Multimodal Multi-Turn (MMT) dialogues, malicious intent can be spread across turns and images, while context-sensitive replies may still advance harmful content. To address this challenge, we present the first systematic definition and study of MMT dialogue safety. Building on this formulation, we introduce the Multimodal Multi-turn Dialogue Safety (MMDS) dataset. We further develop an automated multimodal multi-turn red-teaming framework based on Monte Carlo Tree Search (MCTS) to generate unsafe multimodal multi-turn dialogues for MMDS. MMDS contains 4,484 annotated multimodal dialogue samples with fine-grained safety ratings, policy dimension labels, and evidence-based rationales for both users and assistants. Leveraging MMDS, we present LLaVAShield, a powerful tool that jointly detects and assesses risk in user inputs and assistant responses. Across comprehensive experiments, LLaVAShield consistently outperforms strong baselines on MMT content moderation tasks and under dynamic policy configurations, establishing new state-of-the-art results. We will publicly release the dataset and model to support future research.
Related papers
- From Sparse Decisions to Dense Reasoning: A Multi-attribute Trajectory Paradigm for Multimodal Moderation [59.27094165576015]
We propose a novel learning paradigm (UniMod) that transitions from sparse decision-making to dense reasoning traces.<n>By constructing structured trajectories encompassing evidence grounding, modality assessment, risk mapping, policy decision, and response generation, we reformulate monolithic decision tasks into a multi-dimensional boundary learning process.<n>We introduce specialized optimization strategies to decouple task-specific parameters and rebalance training dynamics, effectively resolving interference between diverse objectives in multi-task learning.
arXiv Detail & Related papers (2026-01-28T09:29:40Z) - MTMCS-Bench: Evaluating Contextual Safety of Multimodal Large Language Models in Multi-Turn Dialogues [39.24594135913578]
We introduce the Multi-Turn Multimodal Contextual Safety Benchmark (MTMCS-Bench), a benchmark of realistic images and multi-turn conversations.<n>MTMCS-Bench offers paired safe and unsafe dialogues with structured evaluation.<n>We observe persistent trade-offs between contextual safety and utility, with models tending to either miss gradual risks or over-refuse benign dialogues.
arXiv Detail & Related papers (2026-01-11T03:10:56Z) - AM$^3$Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs [30.026306656765314]
We present InterSafe-V, an open-source multi-modal dialogue dataset containing 11,270 dialogues and 500 specially designed refusal VQA samples.<n>We propose AM$3$Safety, a framework that combines a cold-start refusal phase with Group Relative Policy Optimization.<n>Experiments on Qwen2.5-VL-7B-Instruct and LLaVA-NeXT-7B show more than 10% decrease in Attack Success Rate.
arXiv Detail & Related papers (2026-01-08T08:57:05Z) - OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models [54.80460603255789]
We introduce OutSafe-Bench, the first most comprehensive content safety evaluation test suite designed for the multimodal era.<n>OutSafe-Bench includes a large-scale dataset that spans four modalities, featuring over 18,000 bilingual (Chinese and English) text prompts, 4,500 images, 450 audio clips and 450 videos, all systematically annotated across nine critical content risk categories.<n>In addition to the dataset, we introduce a Multidimensional Cross Risk Score (MCRS), a novel metric designed to model and assess overlapping and correlated content risks across different categories.
arXiv Detail & Related papers (2025-11-13T13:18:27Z) - SafeMT: Multi-turn Safety for Multimodal Language Models [42.59582247058264]
We introduce SafeMT, a benchmark that features dialogues of varying lengths generated from harmful queries accompanied by images.<n>This benchmark consists of 10,000 samples in total, encompassing 17 different scenarios and four jailbreak methods.<n>We assess the safety of 17 models using this benchmark and discover that the risk of successful attacks on these models increases as the number of turns in harmful dialogues rises.<n>We propose a dialogue safety moderator capable of detecting malicious intent concealed within conversations and providing MLLMs with relevant safety policies.
arXiv Detail & Related papers (2025-10-14T04:24:07Z) - Automating Steering for Safe Multimodal Large Language Models [58.36932318051907]
We introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model.<n>AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected.
arXiv Detail & Related papers (2025-07-17T16:04:55Z) - ShieldVLM: Safeguarding the Multimodal Implicit Toxicity via Deliberative Reasoning with LVLMs [72.8646625127485]
Multimodal implicit toxicity appears not only as formal statements in social platforms but also prompts that can lead to toxic dialogs.<n>Despite the success in unimodal text or image moderation, toxicity detection for multimodal content, particularly the multimodal implicit toxicity, remains underexplored.<n>To advance the detection of multimodal implicit toxicity, we build ShieldVLM, a model which identifies implicit toxicity in multimodal statements, prompts and dialogs via deliberative cross-modal reasoning.
arXiv Detail & Related papers (2025-05-20T07:31:17Z) - IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification [60.38841251693781]
We propose a novel framework to generate robust multi-modal object ReIDs.<n>Our framework uses Modal Prefixes and InverseNet to integrate multi-modal information with semantic guidance from inverted text.<n>Experiments on three multi-modal object ReID benchmarks demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2025-03-13T13:00:31Z) - Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts [56.7225771305861]
This paper introduces Multi-Modal Retrieval-Augmented Generation (M$2$RAG), a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models.<n>The benchmark comprises four tasks: image captioning, multi-modal question answering, multi-modal fact verification, and image reranking.<n>To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT)
arXiv Detail & Related papers (2025-02-24T16:25:25Z) - CUE-M: Contextual Understanding and Enhanced Search with Multimodal Large Language Model [9.224965304457708]
This paper introduces Contextual Understanding and Enhanced Search with MLLM (CUE-M), a novel multimodal search framework.<n>It incorporates image context enrichment, intent refinement, contextual query generation, external API integration, and relevance-based filtering.<n>Experiments on real-word datasets and public benchmarks on knowledge-based VQA and safety demonstrated that CUE-M outperforms baselines and establishes new state-of-the-art results.
arXiv Detail & Related papers (2024-11-19T07:16:48Z) - Adapting Segment Anything Model to Multi-modal Salient Object Detection with Semantic Feature Fusion Guidance [15.435695491233982]
We propose a novel framework to explore and exploit the powerful feature representation and zero-shot generalization ability of the Segment Anything Model (SAM) for multi-modal salient object detection (SOD)
We develop underlineSAM with seunderlinemantic funderlineeature fuunderlinesion guidancunderlinee (Sammese)
In the image encoder, a multi-modal adapter is proposed to adapt the single-modal SAM to multi-modal information. Specifically, in the mask decoder, a semantic-geometric
arXiv Detail & Related papers (2024-08-27T13:47:31Z) - Mixture-of-Prompt-Experts for Multi-modal Semantic Understanding [7.329728566839757]
We propose Mixture-of-Prompt-Experts with Block-Aware Prompt Fusion (MoPE-BAF)
MoPE-BAF is a novel multi-modal soft prompt framework based on the unified vision-language model (VLM)
arXiv Detail & Related papers (2024-03-17T19:12:26Z) - Large AI Model Empowered Multimodal Semantic Communications [48.73159237649128]
We propose a Large AI Model-based Multimodal SC (LAMMSC) framework.
We first present the Conditional-based Multimodal Alignment (MMA) that enables the transformation between multimodal and unimodal data.
Then, a personalized LLM-based Knowledge Base (LKB) is proposed, which allows users to perform personalized semantic extraction or recovery.
Finally, we apply the Generative adversarial network-based channel Estimation (CGE) for estimating the wireless channel state information.
arXiv Detail & Related papers (2023-09-03T19:24:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.