Related papers: Safe Inputs but Unsafe Output: Benchmarking Cross-modality Safety Alignment of Large Vision-Language Model

Safe Inputs but Unsafe Output: Benchmarking Cross-modality Safety Alignment of Large Vision-Language Model

URL: http://arxiv.org/abs/2406.15279v2
Date: Mon, 17 Feb 2025 03:38:42 GMT
Title: Safe Inputs but Unsafe Output: Benchmarking Cross-modality Safety Alignment of Large Vision-Language Model
Authors: Siyin Wang, Xingsong Ye, Qinyuan Cheng, Junwen Duan, Shimin Li, Jinlan Fu, Xipeng Qiu, Xuanjing Huang,
Abstract summary: We introduce a novel safety alignment challenge called Safe Inputs but Unsafe Output (SIUO) to evaluate cross-modality safety alignment.<n>To empirically investigate this problem, we developed the SIUO, a cross-modality benchmark encompassing 9 critical safety domains, such as self-harm, illegal activities, and privacy violations.<n>Our findings reveal substantial safety vulnerabilities in both closed- and open-source LVLMs, underscoring the inadequacy of current models to reliably interpret and respond to complex, real-world scenarios.
Score: 73.8765529028288
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As Artificial General Intelligence (AGI) becomes increasingly integrated into various facets of human life, ensuring the safety and ethical alignment of such systems is paramount. Previous studies primarily focus on single-modality threats, which may not suffice given the integrated and complex nature of cross-modality interactions. We introduce a novel safety alignment challenge called Safe Inputs but Unsafe Output (SIUO) to evaluate cross-modality safety alignment. Specifically, it considers cases where single modalities are safe independently but could potentially lead to unsafe or unethical outputs when combined. To empirically investigate this problem, we developed the SIUO, a cross-modality benchmark encompassing 9 critical safety domains, such as self-harm, illegal activities, and privacy violations. Our findings reveal substantial safety vulnerabilities in both closed- and open-source LVLMs, such as GPT-4V and LLaVA, underscoring the inadequacy of current models to reliably interpret and respond to complex, real-world scenarios.

Related papers

SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization [79.14563283347773]
Multimodal large language models (MLLMs) have demonstrated impressive reasoning and instruction-following capabilities.<n>Cross-modal couplings can produce unsafe semantics even when individual inputs are benign.<n>We propose SafeGRPO, a self-rewarded multimodal safety alignment framework.
arXiv Detail & Related papers (2025-11-17T05:09:49Z)
DeepKnown-Guard: A Proprietary Model-Based Safety Response Framework for AI Agents [12.054307827384415]
Large Language Models (LLMs) have become increasingly prominent, severely constraining their trustworthy deployment in critical domains.<n>This paper proposes a novel safety response framework designed to safeguard LLMs at both the input and output levels.
arXiv Detail & Related papers (2025-11-05T03:04:35Z)
SafeMind: Benchmarking and Mitigating Safety Risks in Embodied LLM Agents [7.975014390527644]
Embodied agents powered by large language models (LLMs) inherit advanced planning capabilities; however, their direct interaction with the physical world exposes them to safety vulnerabilities.<n>We present SafeMindBench, a multimodal benchmark with 5,558 samples spanning four task categories (Instr-Risk, Env-Risk, Order-Fix, Req-Align) across high-risk scenarios such as sabotage, harm, privacy, and illegal behavior.<n>We introduce SafeMindAgent, a modular Planner-Executor architecture integrated with three cascaded safety modules, which incorporate safety constraints into the reasoning process.
arXiv Detail & Related papers (2025-09-30T07:24:04Z)
Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention [53.25106308403173]
We show that existing methods overlook the unique significance of safe reasoning, undermining their trustworthiness and posing potential risks in applications if unsafe reasoning is accessible for and exploited by malicious users.<n>We propose Intervened Preference Optimization (IPO), an alignment method that enforces safe reasoning by substituting compliance steps with safety triggers and constructing pairs for preference learning with strong signals.
arXiv Detail & Related papers (2025-09-29T07:41:09Z)
When Safe Unimodal Inputs Collide: Optimizing Reasoning Chains for Cross-Modal Safety in Multimodal Large Language Models [50.66979825532277]
We introduce Safe-Semantics-but-Unsafe-Interpretation (SSUI), the first dataset featuring interpretable reasoning paths tailored for a cross-modal challenge.<n>A novel training framework, Safety-aware Reasoning Path Optimization (SRPO), is also designed based on the SSUI dataset.<n> Experimental results show that our SRPO-trained models achieve state-of-the-art results on key safety benchmarks.
arXiv Detail & Related papers (2025-09-15T15:40:58Z)
Towards provable probabilistic safety for scalable embodied AI systems [79.31011047593492]
Embodied AI systems are increasingly prevalent across various applications.<n> Ensuring their safety in complex operating environments remains a major challenge.<n>This Perspective offers a pathway toward safer, large-scale adoption of embodied AI systems in safety-critical applications.
arXiv Detail & Related papers (2025-06-05T15:46:25Z)
Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models [29.569220030102986]
We introduce textbfBeyond Safe Answers (BSA) bench, a novel benchmark comprising 2,000 challenging instances organized into three distinct SSA scenario types.<n> Evaluations of 19 state-of-the-art LRMs demonstrate the difficulty of this benchmark, with top-performing models achieving only 38.0% accuracy in correctly identifying risk rationales.<n>Our work provides a comprehensive assessment tool for evaluating and improving safety reasoning fidelity in LRMs, advancing the development of genuinely risk-aware and reliably safe AI systems.
arXiv Detail & Related papers (2025-05-26T08:49:19Z)
Shape it Up! Restoring LLM Safety during Finetuning [66.46166656543761]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z)
Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning Model [30.774446187857475]
We conduct a safety evaluation of 11 Multimodal Large Reasoning Models (MLRMs) across 5 benchmarks.<n>Our analysis reveals distinct safety patterns across different benchmarks.<n>It is a potential approach to address safety issues in MLRMs by leveraging the intrinsic reasoning capabilities of the model to detect unsafe intent.
arXiv Detail & Related papers (2025-05-10T06:59:36Z)
DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models [37.104276926258095]
Multimodal Large Language Models (MLLMs) pose unique safety challenges due to their integration of visual and textual data. We introduce textbfDREAM (textittextbfDisentangling textbfRisks to textbfEnhance Safety textbfAlignment in textbfMLLMs), a novel approach that enhances safety alignment in MLLMs through supervised fine-tuning and iterative Reinforcement Learning from AI Feedback.
arXiv Detail & Related papers (2025-04-25T03:54:24Z)
Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing [12.986006070964772]
Safety alignment is an essential research topic for real-world AI applications. Our study first identified the difficulty of eliminating such vulnerabilities without sacrificing the model's helpfulness. Our method could enhance the model's helpfulness while maintaining safety, thus improving the trade-off-front.
arXiv Detail & Related papers (2025-02-04T09:31:54Z)
SafetyAnalyst: Interpretable, transparent, and steerable safety moderation for AI behavior [56.10557932893919]
We present SafetyAnalyst, a novel AI safety moderation framework. Given an AI behavior, SafetyAnalyst uses chain-of-thought reasoning to analyze its potential consequences. It aggregates all harmful and beneficial effects into a harmfulness score using fully interpretable weight parameters.
arXiv Detail & Related papers (2024-10-22T03:38:37Z)
Multimodal Situational Safety [73.63981779844916]
We present the first evaluation and analysis of a novel safety challenge termed Multimodal Situational Safety. For an MLLM to respond safely, whether through language or action, it often needs to assess the safety implications of a language query within its corresponding visual context. We develop the Multimodal Situational Safety benchmark (MSSBench) to assess the situational safety performance of current MLLMs.
arXiv Detail & Related papers (2024-10-08T16:16:07Z)
Superficial Safety Alignment Hypothesis [8.297367440457508]
We propose the Superficial Safety Alignment Hypothesis (SSAH), which posits that safety alignment should teach an otherwise unsafe model to choose the correct reasoning direction. We identify four types of attribute-critical components in safety-aligned large language models (LLMs) Our findings show that freezing certain safety-critical components 7.5% during fine-tuning allows the model to retain its safety attributes while adapting to new tasks.
arXiv Detail & Related papers (2024-10-07T19:53:35Z)
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? [59.96471873997733]
We propose an empirical foundation for developing more meaningful safety metrics and define AI safety in a machine learning research context. We aim to provide a more rigorous framework for AI safety research, advancing the science of safety evaluations and clarifying the path towards measurable progress.
arXiv Detail & Related papers (2024-07-31T17:59:24Z)
Anomalous State Sequence Modeling to Enhance Safety in Reinforcement Learning [0.0]
We propose a safe reinforcement learning (RL) approach that utilizes an anomalous state sequence to enhance RL safety. In experiments on multiple safety-critical environments including self-driving cars, our solution approach successfully learns safer policies.
arXiv Detail & Related papers (2024-07-29T10:30:07Z)
AI Risk Management Should Incorporate Both Safety and Security [185.68738503122114]
We argue that stakeholders in AI risk management should be aware of the nuances, synergies, and interplay between safety and security. We introduce a unified reference framework to clarify the differences and interplay between AI safety and AI security.
arXiv Detail & Related papers (2024-05-29T21:00:47Z)
Uniformly Safe RL with Objective Suppression for Multi-Constraint Safety-Critical Applications [73.58451824894568]
The widely adopted CMDP model constrains the risks in expectation, which makes room for dangerous behaviors in long-tail states. In safety-critical domains, such behaviors could lead to disastrous outcomes. We propose Objective Suppression, a novel method that adaptively suppresses the task reward maximizing objectives according to a safety critic.
arXiv Detail & Related papers (2024-02-23T23:22:06Z)
The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness [56.174255970895466]
Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications. This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark.
arXiv Detail & Related papers (2023-12-30T17:37:06Z)
Safety-Gymnasium: A Unified Safe Reinforcement Learning Benchmark [12.660770759420286]
We present an environment suite called Safety-Gymnasium, which encompasses safety-critical tasks in both single and multi-agent scenarios. We offer a library of algorithms named Safe Policy Optimization (SafePO), comprising 16 state-of-the-art SafeRL algorithms.
arXiv Detail & Related papers (2023-10-19T08:19:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.