Related papers: Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition

Related papers

Safety Not Found (404): Hidden Risks of LLM-Based Robotics Decision Making [12.400383981686801]
One mistake by an AI system in a safety-critical setting can cost lives.<n>As Large Language Models (LLMs) become integral to robotics decision-making, the physical dimension of risk grows.<n>This paper addresses the urgent need to systematically evaluate LLM performance in scenarios where even minor errors are catastrophic.
arXiv Detail & Related papers (2026-01-09T05:04:15Z)
VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety [3.1109025622085693]
We present Vision Language Safety Understanding, a comprehensive framework to evaluate multimodal safety.<n>Our evaluation of eleven state-of-the-art models reveals systematic joint understanding failures.<n>Our framework exposes weaknesses in joint image-text understanding and alignment gaps in current models.
arXiv Detail & Related papers (2025-10-21T01:30:31Z)
DUAL-Bench: Measuring Over-Refusal and Robustness in Vision-Language Models [59.45605332033458]
Safety mechanisms can backfire, causing over-refusal, where models decline benign requests out of excessive caution.<n>No existing benchmark has systematically addressed over-refusal in the visual modality.<n>This setting introduces unique challenges, such as dual-use cases where an instruction is harmless, but the accompanying image contains harmful content.
arXiv Detail & Related papers (2025-10-12T23:21:34Z)
AccidentBench: Benchmarking Multimodal Understanding and Reasoning in Vehicle Accidents and Beyond [101.20320617562321]
AccidentBench is a large-scale benchmark that combines vehicle accident scenarios with Beyond domains.<n>The benchmark contains approximately 2000 videos and over 19000 human-annotated question-answer pairs.
arXiv Detail & Related papers (2025-09-30T17:59:13Z)
Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models [87.66870367661342]
Large language models (LLMs) are used in AI applications in healthcare.<n>Red-teaming framework that continuously stress-test LLMs can reveal significant weaknesses in four safety-critical domains.<n>A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses.<n>Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.
arXiv Detail & Related papers (2025-07-30T08:44:22Z)
Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation [69.63626052852153]
We propose a proof-of-concept framework that projects how model-generated advice could propagate through societal systems.<n>We also introduce a dataset of 100 indirect harm scenarios, testing models' ability to foresee adverse, non-obvious outcomes from seemingly harmless user prompts.
arXiv Detail & Related papers (2025-06-26T02:28:58Z)
Visual-Semantic Knowledge Conflicts in Operating Rooms: Synthetic Data Curation for Surgical Risk Perception in Multimodal Large Language Models [7.916129615051081]
We introduce a dataset comprising over 34,000 synthetic images generated by diffusion models.<n>The dataset includes 214 human-annotated images that serve as a gold-standard reference for validation.
arXiv Detail & Related papers (2025-06-25T07:06:29Z)
HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model [52.72318433518926]
Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content.<n>We introduce a holistic safety dataset and benchmark, HoliSafe, that spans all five safe/unsafe image-text combinations.<n>We propose SafeLLaVA, a novel VLM augmented with a learnable safety meta token and a dedicated safety head.
arXiv Detail & Related papers (2025-06-05T07:26:34Z)
OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models [73.6716695218951]
Over-refusal is a phenomenon known as $textitover-refusal$ that reduces the practical utility of T2I models.<n>We present OVERT ($textbfOVE$r-$textbfR$efusal evaluation on $textbfT$ext-to-image models), the first large-scale benchmark for assessing over-refusal behaviors.
arXiv Detail & Related papers (2025-05-27T15:42:46Z)
VSCBench: Bridging the Gap in Vision-Language Model Safety Calibration [44.74741064549195]
We introduce the concept of $textitsafety calibration, which systematically addresses both undersafety and oversafety.<n>We present a novel dataset of 3,600 image-text pairs that are visually or textually similar but differ in terms of safety.<n>Based on our benchmark, we evaluate safety calibration across eleven widely used vision-language models.
arXiv Detail & Related papers (2025-05-26T09:01:46Z)
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning [12.467239356591238]
FalseReject is a comprehensive resource containing 16k seemingly toxic queries accompanied by structured responses across 44 safety-related categories.<n>We propose a graph-informed adversarial multi-agent interaction framework to generate diverse and complex prompts.<n>We show that supervised finetuning with FalseReject substantially reduces unnecessary refusals without compromising overall safety or general language capabilities.
arXiv Detail & Related papers (2025-05-12T20:45:25Z)
Transferable Adversarial Attacks on Black-Box Vision-Language Models [63.22532779621001]
adversarial attacks can transfer from open-source to proprietary black-box models in text-only and vision-only contexts.<n>We show that attackers can craft perturbations to induce specific attacker-chosen interpretations of visual information.<n>We discover that universal perturbations -- modifications applicable to a wide set of images -- can consistently induce these misinterpretations.
arXiv Detail & Related papers (2025-05-02T06:51:11Z)
Context-Awareness and Interpretability of Rare Occurrences for Discovery and Formalization of Critical Failure Modes [3.140125449151061]
Vision systems are increasingly deployed in critical domains such as surveillance, law enforcement, and transportation.<n>To address these challenges, we introduce Context-Awareness and Interpretability of Rare Occurrences (CAIRO)<n>CAIRO incentivizes human-in-the-loop for testing and evaluation of criticality that arises from misdetections, adversarial attacks, and hallucinations in AI black-box models.
arXiv Detail & Related papers (2025-04-18T17:12:37Z)
SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models [50.34706204154244]
Acquiring reasoning capabilities catastrophically degrades inherited safety alignment.<n>Certain scenarios suffer 25 times higher attack rates.<n>Despite tight reasoning-answer safety coupling, MLRMs demonstrate nascent self-correction.
arXiv Detail & Related papers (2025-04-09T06:53:23Z)
REVAL: A Comprehension Evaluation on Reliability and Values of Large Vision-Language Models [59.445672459851274]
REVAL is a comprehensive benchmark designed to evaluate the textbfREliability and textbfVALue of Large Vision-Language Models.<n>REVAL encompasses over 144K image-text Visual Question Answering (VQA) samples, structured into two primary sections: Reliability and Values.<n>We evaluate 26 models, including mainstream open-source LVLMs and prominent closed-source models like GPT-4o and Gemini-1.5-Pro.
arXiv Detail & Related papers (2025-03-20T07:54:35Z)
Can't See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs [56.440345471966666]
Multimodal Large Language Models (MLLMs) have expanded the capabilities of traditional language models by enabling interaction through both text and images.<n>This paper introduces MMSafeAware, the first comprehensive multimodal safety awareness benchmark designed to evaluate MLLMs across 29 safety scenarios.<n> MMSafeAware includes both unsafe and over-safety subsets to assess models abilities to correctly identify unsafe content and avoid over-sensitivity that can hinder helpfulness.
arXiv Detail & Related papers (2025-02-16T16:12:40Z)
Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models [25.606641582511106]
We propose a novel dataset that integrates multi-image inputs with safety Chain-of-Thought (CoT) labels as fine-grained reasoning logic to improve model performance.<n>Our experiments demonstrate that fine-tuning InternVL2.5-8B with MIS significantly outperforms both powerful open-source models and API-based models in challenging multi-image tasks.
arXiv Detail & Related papers (2025-01-30T17:59:45Z)
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models [92.79804303337522]
Vision-Language Models (VLMs) may still be vulnerable to safety alignment issues.<n>We introduce MLAI, a novel jailbreak framework that leverages scenario-aware image generation for semantic alignment.<n>Extensive experiments demonstrate MLAI's significant impact, achieving attack success rates of 77.75% on MiniGPT-4 and 82.80% on LLaVA-2.
arXiv Detail & Related papers (2024-11-27T02:40:29Z)
Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces [5.993182776695029]
We introduce Clip2Safety, an interpretable detection framework for diverse workplace safety compliance. The framework comprises four main modules: scene recognition, the visual prompt, safety items detection, and fine-grained verification. The results show that Clip2Safety not only demonstrates an accuracy improvement over state-of-the-art question-answering based VLMs but also achieves inference times two hundred times faster.
arXiv Detail & Related papers (2024-08-13T18:32:06Z)
High-Dimensional Fault Tolerance Testing of Highly Automated Vehicles Based on Low-Rank Models [39.139025989575686]
Fault Injection (FI) testing is conducted to evaluate the safety level of HAVs. To fully cover test cases, various driving scenarios and fault settings should be considered. We propose to accelerate FI testing under the low-rank Smoothness Regularized Matrix Factorization framework.
arXiv Detail & Related papers (2024-07-28T14:27:13Z)
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection. We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance. We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z)
Building Safe and Reliable AI systems for Safety Critical Tasks with Vision-Language Processing [1.2183405753834557]
Current AI algorithms are unable to identify common causes for failure detection. Additional techniques are required to quantify the quality of predictions. This thesis will focus on vision-language data processing for tasks like classification, image captioning, and vision question answering.
arXiv Detail & Related papers (2023-08-06T18:05:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.