Beyond Visual Safety: Jailbreaking Multimodal Large Language Models for Harmful Image Generation via Semantic-Agnostic Inputs
- URL: http://arxiv.org/abs/2601.15698v1
- Date: Thu, 22 Jan 2026 06:56:27 GMT
- Title: Beyond Visual Safety: Jailbreaking Multimodal Large Language Models for Harmful Image Generation via Semantic-Agnostic Inputs
- Authors: Mingyu Yu, Lana Liu, Zhehao Zhao, Wei Wang, Sujuan Qin,
- Abstract summary: Beyond Visual Safety (BVS) is a novel image-text pair jailbreaking framework designed to probe the visual safety boundaries of MLLMs.<n>BVS employs a "reconstruction-then-generation" strategy, leveraging visual splicing and inductive recomposition to decouple malicious intent from raw inputs.<n>Our findings expose critical vulnerabilities in the visual safety alignment of current MLLMs.
- Score: 2.903006172774433
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has introduced complex security challenges, particularly at the intersection of textual and visual safety. While existing schemes have explored the security vulnerabilities of MLLMs, the investigation into their visual safety boundaries remains insufficient. In this paper, we propose Beyond Visual Safety (BVS), a novel image-text pair jailbreaking framework specifically designed to probe the visual safety boundaries of MLLMs. BVS employs a "reconstruction-then-generation" strategy, leveraging neutralized visual splicing and inductive recomposition to decouple malicious intent from raw inputs, thereby leading MLLMs to be induced into generating harmful images. Experimental results demonstrate that BVS achieves a remarkable jailbreak success rate of 98.21\% against GPT-5 (12 January 2026 release). Our findings expose critical vulnerabilities in the visual safety alignment of current MLLMs.
Related papers
- MIDAS: Multi-Image Dispersion and Semantic Reconstruction for Jailbreaking MLLMs [22.919956583415324]
Multi-Image Dispersion and Semantic Reconstruction (MIDAS)<n>We propose a multimodal jailbreak framework that decomposes harmful semantics into risk-bearing subunits.<n>MIDAS enforces longer and more structured multi-image chained reasoning.
arXiv Detail & Related papers (2026-02-28T09:29:36Z) - Odysseus: Jailbreaking Commercial Multimodal LLM-integrated Systems via Dual Steganography [77.44136793431893]
We propose a novel jailbreak paradigm that introduces dual steganography to covertly embed malicious queries into benign-looking images.<n>Our Odysseus successfully jailbreaks several pioneering and realistic MLLM-integrated systems, achieving up to 99% attack success rate.
arXiv Detail & Related papers (2025-12-23T08:53:36Z) - VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack [40.68344330540352]
Multimodal Large Language Models (MLLMs) are widely used in various fields due to their powerful cross-modal comprehension and generation capabilities.<n>Previous jailbreak attacks try to explore reasoning safety risk in text modal, while similar threats have been largely overlooked in the visual modal.<n>We propose Visual Reasoning Sequential Attack (VRSA), which induces MLLMs to gradually externalize and aggregate complete harmful intent.
arXiv Detail & Related papers (2025-12-05T16:29:52Z) - Sequential Comics for Jailbreaking Multimodal Large Language Models via Structured Visual Storytelling [11.939828002077482]
Multimodal large language models (MLLMs) exhibit remarkable capabilities but remain susceptible to jailbreak attacks.<n>We introduce a novel method that leverages sequential comic-style visual narratives to circumvent safety alignments in state-of-the-art MLLMs.<n>Our approach achieves an average attack success rate of 83.5%, surpassing prior state-of-the-art by 46%.
arXiv Detail & Related papers (2025-10-16T18:30:26Z) - Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models [0.0]
camouflaged jailbreaking embeds malicious intent within seemingly benign language to evade existing safety mechanisms.<n>This paper investigates the construction and impact of camouflaged jailbreak prompts, emphasizing their deceptive characteristics and the limitations of traditional keyword-based detection methods.
arXiv Detail & Related papers (2025-09-05T19:57:38Z) - Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models [83.80177564873094]
We propose a unified multimodal universal jailbreak attack framework.<n>We evaluate the undesirable context generation of MLLMs like LLaVA, Yi-VL, MiniGPT4, MiniGPT-v2, and InstructBLIP.<n>This study underscores the urgent need for robust safety measures in MLLMs.
arXiv Detail & Related papers (2025-06-02T04:33:56Z) - Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy [31.03584769307822]
We propose JOOD, a new Jailbreak framework via OOD-ifying inputs beyond the safety alignment.<n>Experiments across diverse jailbreak scenarios demonstrate that JOOD effectively jailbreaks recent proprietary LLMs and MLLMs.
arXiv Detail & Related papers (2025-03-26T01:25:24Z) - MLLM-as-a-Judge for Image Safety without Human Labeling [81.24707039432292]
In the age of AI-generated content (AIGC), many image generation models are capable of producing harmful content.<n>It is crucial to identify such unsafe images based on established safety rules.<n>Existing approaches typically fine-tune MLLMs with human-labeled datasets.
arXiv Detail & Related papers (2024-12-31T00:06:04Z) - Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models [80.77246856082742]
Safety Snowball Agent (SSA) is a novel agent-based framework leveraging agents' autonomous and tool-using abilities to jailbreak LVLMs.<n>Our experiments demonstrate that ours can use nearly any image to induce LVLMs to produce unsafe content, achieving high success jailbreaking rates against the latest LVLMs.
arXiv Detail & Related papers (2024-11-18T11:58:07Z) - Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models [107.88745040504887]
We study the harmlessness alignment problem of multimodal large language models (MLLMs)<n>Inspired by this, we propose a novel jailbreak method named HADES, which hides and amplifies the harmfulness of the malicious intent within the text input.<n> Experimental results show that HADES can effectively jailbreak existing MLLMs, which achieves an average Attack Success Rate (ASR) of 90.26% for LLaVA-1.5 and 71.60% for Gemini Pro Vision.
arXiv Detail & Related papers (2024-03-14T18:24:55Z) - Visual Adversarial Examples Jailbreak Aligned Large Language Models [66.53468356460365]
We show that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks.
We exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision.
Our study underscores the escalating adversarial risks associated with the pursuit of multimodality.
arXiv Detail & Related papers (2023-06-22T22:13:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.