Related papers: MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?

MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?

URL: http://arxiv.org/abs/2406.17806v1
Date: Sat, 22 Jun 2024 23:26:07 GMT
Title: MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries?
Authors: Xirui Li, Hengguang Zhou, Ruochen Wang, Tianyi Zhou, Minhao Cheng, Cho-Jui Hsieh,
Abstract summary: Humans are prone to cognitive distortions -- biased thinking patterns that lead to exaggerated responses to specific stimuli. This paper demonstrates that advanced Multimodal Large Language Models (MLLMs) exhibit similar tendencies. We identify three types of stimuli that trigger the oversensitivity of existing MLLMs: Exaggerated Risk, Negated Harm, and Counterintuitive.
Score: 70.77691645678804
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Humans are prone to cognitive distortions -- biased thinking patterns that lead to exaggerated responses to specific stimuli, albeit in very different contexts. This paper demonstrates that advanced Multimodal Large Language Models (MLLMs) exhibit similar tendencies. While these models are designed to respond queries under safety mechanism, they sometimes reject harmless queries in the presence of certain visual stimuli, disregarding the benign nature of their contexts. As the initial step in investigating this behavior, we identify three types of stimuli that trigger the oversensitivity of existing MLLMs: Exaggerated Risk, Negated Harm, and Counterintuitive Interpretation. To systematically evaluate MLLMs' oversensitivity to these stimuli, we propose the Multimodal OverSenSitivity Benchmark (MOSSBench). This toolkit consists of 300 manually collected benign multimodal queries, cross-verified by third-party reviewers (AMT). Empirical studies using MOSSBench on 20 MLLMs reveal several insights: (1). Oversensitivity is prevalent among SOTA MLLMs, with refusal rates reaching up to 76% for harmless queries. (2). Safer models are more oversensitive: increasing safety may inadvertently raise caution and conservatism in the model's responses. (3). Different types of stimuli tend to cause errors at specific stages -- perception, intent reasoning, and safety judgement -- in the response process of MLLMs. These findings highlight the need for refined safety mechanisms that balance caution with contextually appropriate responses, improving the reliability of MLLMs in real-world applications. We make our project available at https://turningpoint-ai.github.io/MOSSBench/.

Related papers

The Emotional Baby Is Truly Deadly: Does your Multimodal Large Reasoning Model Have Emotional Flattery towards Humans? [10.208269928409138]
EmoAgent orchestrates exaggerated affective prompts to hijack reasoning pathways.<n>We identify persistent high-risk failure modes in transparent deep-thinking scenarios.<n>Experiments on advanced MLRMs demonstrate the effectiveness of EmoAgent.
arXiv Detail & Related papers (2025-08-06T00:39:28Z)
Hidden in Plain Sight: Probing Implicit Reasoning in Multimodal Language Models [21.698247799954654]
Multimodal large language models (MLLMs) are increasingly deployed in open-ended, real-world environments.<n>This paper presents a systematic analysis of how current MLLMs handle implicit reasoning scenarios.<n>We find that models frequently fail to surface hidden issues, even when they possess the necessary perceptual and reasoning skills.
arXiv Detail & Related papers (2025-05-30T21:47:28Z)
SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models [50.34706204154244]
Acquiring reasoning capabilities catastrophically degrades inherited safety alignment. Certain scenarios suffer 25 times higher attack rates. Despite tight reasoning-answer safety coupling, MLRMs demonstrate nascent self-correction.
arXiv Detail & Related papers (2025-04-09T06:53:23Z)
Seeing What's Not There: Spurious Correlation in Multimodal LLMs [47.651861502104715]
We introduce SpurLens, a pipeline that automatically identifies spurious visual cues without human supervision. Our findings reveal that spurious correlations cause two major failure modes in Multimodal Large Language Models (MLLMs) By exposing the persistence of spurious correlations, our study calls for more rigorous evaluation methods and mitigation strategies to enhance the reliability of MLLMs.
arXiv Detail & Related papers (2025-03-11T20:53:00Z)
Calling a Spade a Heart: Gaslighting Multimodal Large Language Models via Negation [65.92001420372007]
This paper systematically evaluates state-of-the-art MLLMs across diverse benchmarks. We introduce the first benchmark GaslightingBench, specifically designed to evaluate the vulnerability of MLLMs to negation arguments.
arXiv Detail & Related papers (2025-01-31T10:37:48Z)
SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types [21.683010095703832]
We develop a novel benchmark to assess the generalization of large language model (LLM) safety across various tasks and prompt types. This benchmark integrates both generative and discriminative evaluation tasks and includes extended data to examine the impact of prompt engineering and jailbreak on LLM safety. Our assessment reveals that most LLMs perform worse on discriminative tasks than generative ones, and are highly susceptible to prompts, indicating poor generalization in safety alignment.
arXiv Detail & Related papers (2024-10-29T11:47:01Z)
SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models [75.67623347512368]
We propose toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs. Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol. Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs.
arXiv Detail & Related papers (2024-10-24T17:14:40Z)
ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs [72.13489820420726]
ProSA is a framework designed to evaluate and comprehend prompt sensitivity in large language models. Our study uncovers that prompt sensitivity fluctuates across datasets and models, with larger models exhibiting enhanced robustness.
arXiv Detail & Related papers (2024-10-16T09:38:13Z)
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors [64.9938658716425]
Existing evaluations of large language models' (LLMs) ability to recognize and reject unsafe user requests face three limitations. First, existing methods often use coarse-grained of unsafe topics, and are over-representing some fine-grained topics. Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations. Third, existing evaluations rely on large LLMs for evaluation, which can be expensive.
arXiv Detail & Related papers (2024-06-20T17:56:07Z)
OR-Bench: An Over-Refusal Benchmark for Large Language Models [65.34666117785179]
Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs. This study proposes a novel method for automatically generating large-scale sets of "seemingly toxic prompts" We then conduct a comprehensive study to measure the over-refusal of 25 popular LLMs across 8 model families.
arXiv Detail & Related papers (2024-05-31T15:44:33Z)
Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models: A Causal Perspective [9.633811630889237]
We propose a causal framework to interpret the biases in Visual Question Answering (VQA) problems. We introduce a novel dataset with 12,000 challenging VQA instances requiring multi-hop reasoning. Our experiments show that MLLMs perform poorly on MORE, indicating strong unimodal biases and limited semantic understanding.
arXiv Detail & Related papers (2024-03-27T08:38:49Z)
MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models [41.708401515627784]
We observe that Multimodal Large Language Models (MLLMs) can be easily compromised by query-relevant images. We introduce MM-SafetyBench, a framework designed for conducting safety-critical evaluations of MLLMs against such image-based manipulations. Our work underscores the need for a concerted effort to strengthen and enhance the safety measures of open-source MLLMs against potential malicious exploits.
arXiv Detail & Related papers (2023-11-29T12:49:45Z)
InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models [50.03163753638256]
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning. We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark.
arXiv Detail & Related papers (2023-11-20T07:06:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.