CSR-Bench: A Benchmark for Evaluating the Cross-modal Safety and Reliability of MLLMs
- URL: http://arxiv.org/abs/2602.03263v1
- Date: Tue, 03 Feb 2026 08:49:44 GMT
- Title: CSR-Bench: A Benchmark for Evaluating the Cross-modal Safety and Reliability of MLLMs
- Authors: Yuxuan Liu, Yuntian Shi, Kun Wang, Haoting Shen, Kun Yang,
- Abstract summary: Multimodal large language models (MLLMs) enable interaction over both text and images.<n>This paper introduces CSR-Bench, a benchmark for evaluating cross-modal reliability.<n>We evaluate 16 state-of-the-art MLLMs and observe systematic cross-modal alignment gaps.
- Score: 10.42126976065225
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal large language models (MLLMs) enable interaction over both text and images, but their safety behavior can be driven by unimodal shortcuts instead of true joint intent understanding. We introduce CSR-Bench, a benchmark for evaluating cross-modal reliability through four stress-testing interaction patterns spanning Safety, Over-rejection, Bias, and Hallucination, covering 61 fine-grained types. Each instance is constructed to require integrated image-text interpretation, and we additionally provide paired text-only controls to diagnose modality-induced behavior shifts. We evaluate 16 state-of-the-art MLLMs and observe systematic cross-modal alignment gaps. Models show weak safety awareness, strong language dominance under interference, and consistent performance degradation from text-only controls to multimodal inputs. We also observe a clear trade-off between reducing over-rejection and maintaining safe, non-discriminatory behavior, suggesting that some apparent safety gains may come from refusal-oriented heuristics rather than robust intent understanding. WARNING: This paper contains unsafe contents.
Related papers
- MTMCS-Bench: Evaluating Contextual Safety of Multimodal Large Language Models in Multi-Turn Dialogues [39.24594135913578]
We introduce the Multi-Turn Multimodal Contextual Safety Benchmark (MTMCS-Bench), a benchmark of realistic images and multi-turn conversations.<n>MTMCS-Bench offers paired safe and unsafe dialogues with structured evaluation.<n>We observe persistent trade-offs between contextual safety and utility, with models tending to either miss gradual risks or over-refuse benign dialogues.
arXiv Detail & Related papers (2026-01-11T03:10:56Z) - LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation [4.29885665563186]
LATENTGUARD is a framework that combines behavioral alignment with supervised latent space control for interpretable and precise safety steering.<n>Our results show significant improvements in both safety controllability and response interpretability without compromising utility.
arXiv Detail & Related papers (2025-09-24T07:31:54Z) - When Safe Unimodal Inputs Collide: Optimizing Reasoning Chains for Cross-Modal Safety in Multimodal Large Language Models [50.66979825532277]
We introduce Safe-Semantics-but-Unsafe-Interpretation (SSUI), the first dataset featuring interpretable reasoning paths tailored for a cross-modal challenge.<n>A novel training framework, Safety-aware Reasoning Path Optimization (SRPO), is also designed based on the SSUI dataset.<n> Experimental results show that our SRPO-trained models achieve state-of-the-art results on key safety benchmarks.
arXiv Detail & Related papers (2025-09-15T15:40:58Z) - Security Tensors as a Cross-Modal Bridge: Extending Text-Aligned Safety to Vision in LVLM [40.83149588857177]
Large visual-language models (LVLMs) integrate aligned large language models (LLMs) with visual modules to process multimodal inputs.<n>We introduce security tensors - trainable input vectors applied during inference through either the textual or visual modality.
arXiv Detail & Related papers (2025-07-28T16:59:53Z) - Automating Steering for Safe Multimodal Large Language Models [58.36932318051907]
We introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model.<n>AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected.
arXiv Detail & Related papers (2025-07-17T16:04:55Z) - ROSE: Toward Reality-Oriented Safety Evaluation of Large Language Models [60.28667314609623]
Large Language Models (LLMs) are increasingly deployed as black-box components in real-world applications.<n>We propose Reality-Oriented Safety Evaluation (ROSE), a novel framework that uses multi-objective reinforcement learning to fine-tune an adversarial LLM.
arXiv Detail & Related papers (2025-06-17T10:55:17Z) - Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models [83.80177564873094]
We propose a unified multimodal universal jailbreak attack framework.<n>We evaluate the undesirable context generation of MLLMs like LLaVA, Yi-VL, MiniGPT4, MiniGPT-v2, and InstructBLIP.<n>This study underscores the urgent need for robust safety measures in MLLMs.
arXiv Detail & Related papers (2025-06-02T04:33:56Z) - MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks [85.3303135160762]
MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models.<n>It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines.<n>We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
arXiv Detail & Related papers (2025-03-24T20:38:42Z) - Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning [43.209846711845536]
Current alignment strategies rely on supervised safety fine-tuning with curated datasets.<n>We show that supervised fine-tuning inadvertently reinforces spurious correlations between superficial textual patterns and safety responses.<n>We show that machine unlearning (MU) is a powerful alternative to supervised safety fine-tuning.
arXiv Detail & Related papers (2025-03-14T19:52:08Z) - Safe Inputs but Unsafe Output: Benchmarking Cross-modality Safety Alignment of Large Vision-Language Model [73.8765529028288]
We introduce a novel safety alignment challenge called Safe Inputs but Unsafe Output (SIUO) to evaluate cross-modality safety alignment.<n>To empirically investigate this problem, we developed the SIUO, a cross-modality benchmark encompassing 9 critical safety domains, such as self-harm, illegal activities, and privacy violations.<n>Our findings reveal substantial safety vulnerabilities in both closed- and open-source LVLMs, underscoring the inadequacy of current models to reliably interpret and respond to complex, real-world scenarios.
arXiv Detail & Related papers (2024-06-21T16:14:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.