Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study
- URL: http://arxiv.org/abs/2505.15389v1
- Date: Wed, 21 May 2025 11:26:40 GMT
- Title: Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study
- Authors: DongGeon Lee, Joonwon Jang, Jihae Jeong, Hwanjo Yu,
- Abstract summary: This study asks: How safe are current vision-language models when confronted with meme images that ordinary users share?<n>We introduce MemeSafetyBench, a benchmark pairing real meme images with both harmful and benign instructions.<n>We find that vision-language models show greater vulnerability to meme-based harmful prompts than to synthetic or typographic images.
- Score: 14.308220140623247
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Rapid deployment of vision-language models (VLMs) magnifies safety risks, yet most evaluations rely on artificial images. This study asks: How safe are current VLMs when confronted with meme images that ordinary users share? To investigate this question, we introduce MemeSafetyBench, a 50,430-instance benchmark pairing real meme images with both harmful and benign instructions. Using a comprehensive safety taxonomy and LLM-based instruction generation, we assess multiple VLMs across single and multi-turn interactions. We investigate how real-world memes influence harmful outputs, the mitigating effects of conversational context, and the relationship between model scale and safety metrics. Our findings demonstrate that VLMs show greater vulnerability to meme-based harmful prompts than to synthetic or typographic images. Memes significantly increase harmful responses and decrease refusals compared to text-only inputs. Though multi-turn interactions provide partial mitigation, elevated vulnerability persists. These results highlight the need for ecologically valid evaluations and stronger safety mechanisms.
Related papers
- Detecting Harmful Memes with Decoupled Understanding and Guided CoT Reasoning [26.546646866501735]
We introduce U-CoT+, a novel framework for harmful meme detection.<n>We first develop a high-fidelity meme-to-text pipeline that converts visual memes into detail-preserving textual descriptions.<n>This design decouples meme interpretation from meme classification, thus avoiding immediate reasoning over complex raw visual content.
arXiv Detail & Related papers (2025-06-10T06:10:45Z) - HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model [52.72318433518926]
Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content.<n>We introduce a holistic safety dataset and benchmark, HoliSafe, that spans all five safe/unsafe image-text combinations.<n>We propose SafeLLaVA, a novel VLM augmented with a learnable safety meta token and a dedicated safety head.
arXiv Detail & Related papers (2025-06-05T07:26:34Z) - Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs [51.90597846977058]
Video-SafetyBench is the first benchmark designed to evaluate the safety of LVLMs under video-text attacks.<n>It comprises 2,264 video-text pairs spanning 48 fine-grained unsafe categories.<n>To generate semantically accurate videos for safety evaluation, we design a controllable pipeline that decomposes video semantics into subject images and motion text.
arXiv Detail & Related papers (2025-05-17T05:06:38Z) - Transferable Adversarial Attacks on Black-Box Vision-Language Models [63.22532779621001]
adversarial attacks can transfer from open-source to proprietary black-box models in text-only and vision-only contexts.<n>We show that attackers can craft perturbations to induce specific attacker-chosen interpretations of visual information.<n>We discover that universal perturbations -- modifications applicable to a wide set of images -- can consistently induce these misinterpretations.
arXiv Detail & Related papers (2025-05-02T06:51:11Z) - MemeBLIP2: A novel lightweight multimodal system to detect harmful memes [10.174106475035689]
We introduce MemeBLIP2, a light weight multimodal system that detects harmful memes by combining image and text features effectively.<n>We build on previous studies by adding modules that align image and text representations into a shared space and fuse them for better classification.<n>The results show that MemeBLIP2 can capture subtle cues in both modalities, even in cases with ironic or culturally specific content.
arXiv Detail & Related papers (2025-04-29T23:41:06Z) - Can't See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs [56.440345471966666]
Multimodal Large Language Models (MLLMs) have expanded the capabilities of traditional language models by enabling interaction through both text and images.<n>This paper introduces MMSafeAware, the first comprehensive multimodal safety awareness benchmark designed to evaluate MLLMs across 29 safety scenarios.<n> MMSafeAware includes both unsafe and over-safety subsets to assess models abilities to correctly identify unsafe content and avoid over-sensitivity that can hinder helpfulness.
arXiv Detail & Related papers (2025-02-16T16:12:40Z) - MLLM-as-a-Judge for Image Safety without Human Labeling [81.24707039432292]
In the age of AI-generated content (AIGC), many image generation models are capable of producing harmful content.<n>It is crucial to identify such unsafe images based on established safety rules.<n>Existing approaches typically fine-tune MLLMs with human-labeled datasets.
arXiv Detail & Related papers (2024-12-31T00:06:04Z) - VLSBench: Unveiling Visual Leakage in Multimodal Safety [39.344623032631475]
Safety concerns of Multimodal large language models (MLLMs) have gradually become an important problem in various applications.<n>Previous works indicate a counterintuitive phenomenon that using textual unlearning to align MLLMs achieves comparable safety performances with MLLMs aligned with image text pairs.
arXiv Detail & Related papers (2024-11-29T18:56:37Z) - CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration [90.36429361299807]
multimodal large language models (MLLMs) have demonstrated remarkable success in engaging in conversations involving visual inputs.
The integration of visual modality has introduced a unique vulnerability: the MLLM becomes susceptible to malicious visual inputs.
We introduce a technique termed CoCA, which amplifies the safety-awareness of the MLLM by calibrating its output distribution.
arXiv Detail & Related papers (2024-09-17T17:14:41Z) - GOAT-Bench: Safety Insights to Large Multimodal Models through Meme-Based Social Abuse [14.571295331012331]
We introduce the comprehensive meme benchmark, GOAT-Bench, comprising over 6K varied memes encapsulating themes such as implicit hate speech, cyberbullying, and sexism, etc.<n>We delve into the ability of LMMs to accurately assess hatefulness, misogyny, offensiveness, sarcasm, and harmful content.<n>Our extensive experiments across a range of LMMs reveal that current models still exhibit a deficiency in safety awareness, showing insensitivity to various forms of implicit abuse.
arXiv Detail & Related papers (2024-01-03T03:28:55Z) - Beneath the Surface: Unveiling Harmful Memes with Multimodal Reasoning
Distilled from Large Language Models [17.617187709968242]
Existing harmful meme detection approaches only recognize superficial harm-indicative signals in an end-to-end classification manner.
We propose a novel generative framework to learn reasonable thoughts from Large Language Models for better multimodal fusion.
Our proposed approach achieves superior performance than state-of-the-art methods on the harmful meme detection task.
arXiv Detail & Related papers (2023-12-09T01:59:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.