Related papers: DUAL-Bench: Measuring Over-Refusal and Robustness in Vision-Language Models

DUAL-Bench: Measuring Over-Refusal and Robustness in Vision-Language Models

URL: http://arxiv.org/abs/2510.10846v1
Date: Sun, 12 Oct 2025 23:21:34 GMT
Title: DUAL-Bench: Measuring Over-Refusal and Robustness in Vision-Language Models
Authors: Kaixuan Ren, Preslav Nakov, Usman Naseem,
Abstract summary: Safety mechanisms can backfire, causing over-refusal, where models decline benign requests out of excessive caution.<n>No existing benchmark has systematically addressed over-refusal in the visual modality.<n>This setting introduces unique challenges, such as dual-use cases where an instruction is harmless, but the accompanying image contains harmful content.
Score: 59.45605332033458
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As vision-language models become increasingly capable, maintaining a balance between safety and usefulness remains a central challenge. Safety mechanisms, while essential, can backfire, causing over-refusal, where models decline benign requests out of excessive caution. Yet, no existing benchmark has systematically addressed over-refusal in the visual modality. This setting introduces unique challenges, such as dual-use cases where an instruction is harmless, but the accompanying image contains harmful content. Models frequently fail in such scenarios, either refusing too conservatively or completing tasks unsafely, which highlights the need for more fine-grained alignment. The ideal behavior is safe completion, i.e., fulfilling the benign parts of a request while explicitly warning about any potentially harmful elements. To address this, we present DUAL-Bench, the first multimodal benchmark focused on over-refusal and safe completion in VLMs. We evaluated 18 VLMs across 12 hazard categories, with focus on their robustness under semantics-preserving visual perturbations. The results reveal substantial room for improvement: GPT-5-Nano achieves 12.9% safe completion, GPT-5 models average 7.9%, and Qwen models only 3.9%. We hope that DUAL-Bench will foster the development of more nuanced alignment strategies that ensure models remain both safe and useful in complex multimodal settings.

Related papers

Health-ORSC-Bench: A Benchmark for Measuring Over-Refusal and Safety Completion in Health Context [82.32380418146656]
Health-ORSC-Bench is the first large-scale benchmark designed to measure textbfOver-Refusal and textbfSafe Completion quality in healthcare.<n>Our framework uses an automated pipeline with human validation to test models at varying levels of intent ambiguity.<n>Health-ORSC-Bench provides a rigorous standard for calibrating the next generation of medical AI assistants.
arXiv Detail & Related papers (2026-01-25T01:28:52Z)
A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5 [101.4233736714284]
Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have driven major gains in reasoning, perception, and generation across language and vision.<n>We present an integrated safety evaluation of six frontier models--GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5--assessing each across language, vision-language, and image generation.
arXiv Detail & Related papers (2026-01-15T15:52:52Z)
SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models [66.71948519280669]
Multimodal Large Reasoning Models (MLRMs) demonstrate impressive crossmodal reasoning but often amplify safety risks under adversarial prompts.<n> Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models to implicit risks.<n>We propose SaFeR-VLM, which integrates four components and supports dynamic and interpretable safety decisions beyond surface-level filtering.
arXiv Detail & Related papers (2025-10-08T10:39:12Z)
HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model [58.12612140992874]
We introduce a holistic safety dataset and benchmark, textbfHoliSafe, that spans all five safe/unsafe image-text combinations.<n>We also propose a novel modular framework for enhancing VLM safety with a visual guard module (VGM) designed to assess the harmfulness of input images.<n> Experiments show that Safe-VLM with VGM, trained on our HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks.
arXiv Detail & Related papers (2025-06-05T07:26:34Z)
From Evaluation to Defense: Advancing Safety in Video Large Language Models [33.10355085086974]
We introduce textbfVideoSafetyBench (VSB-77k) - the first large-scale, culturally diverse benchmark for Video LLM safety.<n> integrating video modality degrades safety performance by an average of 42.3%, exposing systemic risks in multimodal attack exploitation.<n>We propose textbfVideoSafety-R1, a dual-stage framework achieving unprecedented safety gains through two innovations.
arXiv Detail & Related papers (2025-05-22T13:16:53Z)
The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 [70.94607997570729]
We present a comprehensive safety assessment of OpenAI-o3 and DeepSeek-R1 reasoning models.<n>We investigate their susceptibility to adversarial attacks, such as jailbreaking and prompt injection, to assess their robustness in real-world applications.
arXiv Detail & Related papers (2025-02-18T09:06:07Z)
Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models [25.606641582511106]
We propose a novel dataset that integrates multi-image inputs with safety Chain-of-Thought (CoT) labels as fine-grained reasoning logic to improve model performance.<n>Our experiments demonstrate that fine-tuning InternVL2.5-8B with MIS significantly outperforms both powerful open-source models and API-based models in challenging multi-image tasks.
arXiv Detail & Related papers (2025-01-30T17:59:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.