Related papers: When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

URL: http://arxiv.org/abs/2602.10179v1
Date: Tue, 10 Feb 2026 18:59:55 GMT
Title: When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models
Authors: Jiacheng Hou, Yining Sun, Ruochong Jin, Haochen Han, Fangming Liu, Wai Kin Victor Chan, Alex Jinpeng Wang,
Abstract summary: We propose Vision-Centric Jailbreak Attack (VJA), the first visual-to-visual jailbreak attack.<n>VJA conveys malicious instructions purely through visual inputs.<n>We propose a training-free defense based on introspective multimodal reasoning.
Score: 19.655310421085435
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in large image editing models have shifted the paradigm from text-driven instructions to vision-prompt editing, where user intent is inferred directly from visual inputs such as marks, arrows, and visual-text prompts. While this paradigm greatly expands usability, it also introduces a critical and underexplored safety risk: the attack surface itself becomes visual. In this work, we propose Vision-Centric Jailbreak Attack (VJA), the first visual-to-visual jailbreak attack that conveys malicious instructions purely through visual inputs. To systematically study this emerging threat, we introduce IESBench, a safety-oriented benchmark for image editing models. Extensive experiments on IESBench demonstrate that VJA effectively compromises state-of-the-art commercial models, achieving attack success rates of up to 80.9% on Nano Banana Pro and 70.1% on GPT-Image-1.5. To mitigate this vulnerability, we propose a training-free defense based on introspective multimodal reasoning, which substantially improves the safety of poorly aligned models to a level comparable with commercial systems, without auxiliary guard models and with negligible computational overhead. Our findings expose new vulnerabilities, provide both a benchmark and practical defense to advance safe and trustworthy modern image editing systems. Warning: This paper contains offensive images created by large image editing models.

Related papers

VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models [57.128876964730644]
Image-to-Video (I2V) generation models, which condition video generation on reference images, have shown emerging visual instruction-following capability.<n>We propose Visual Instruction Injection (VII), a training-free and transferable jailbreaking framework that disguises the malicious intent of unsafe text prompts as benign visual instructions in the safe reference image.<n>VII achieves Attack Success Rates of up to 83.5% while reducing Refusal Rates to near zero, significantly outperforming existing baselines.
arXiv Detail & Related papers (2026-02-24T15:20:01Z)
Robustness of Vision Language Models Against Split-Image Harmful Input Attacks [4.937150501683971]
Vision-Language Models (VLMs) are now a core part of modern AI.<n>Recent work proposed several visual jailbreak attacks using single/ holistic images.<n>We introduce novel split-image visual jailbreak attacks (SIVA) that exploit this misalignment.
arXiv Detail & Related papers (2026-02-08T21:52:42Z)
Jailbreaks on Vision Language Model via Multimodal Reasoning [10.066621451320792]
We present a framework that exploits post-training Chain-of-Thought prompting to construct stealthy prompts capable of bypassing safety filters.<n>We also propose a ReAct-driven adaptive noising mechanism that iteratively perturbs input images based on model feedback.
arXiv Detail & Related papers (2026-01-29T23:09:24Z)
SafeGuider: Robust and Practical Content Safety Control for Text-to-Image Models [74.11062256255387]
Text-to-image models are highly vulnerable to adversarial prompts, which can bypass safety measures and produce harmful content.<n>We introduce SafeGuider, a two-step framework designed for robust safety control without compromising generation quality.<n>SafeGuider demonstrates exceptional effectiveness in minimizing attack success rates, achieving a maximum rate of only 5.48% across various attack scenarios.
arXiv Detail & Related papers (2025-10-05T10:24:48Z)
Robustifying Vision-Language Models via Dynamic Token Reweighting [28.675118345987887]
Large vision-language models (VLMs) are highly vulnerable to jailbreak attacks.<n>We present a novel inference-time defense that mitigates multimodal jailbreak attacks.<n>We introduce a new formulation of the safety-relevant distributional shift induced by the visual modality.
arXiv Detail & Related papers (2025-05-22T03:00:39Z)
Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense [90.71884758066042]
Large vision-language models (LVLMs) introduce a unique vulnerability: susceptibility to malicious attacks via visual inputs.<n>We propose ESIII (Embedding Security Instructions Into Images), a novel methodology for transforming the visual space from a source of vulnerability into an active defense mechanism.
arXiv Detail & Related papers (2025-03-14T17:39:45Z)
SC-Pro: Training-Free Framework for Defending Unsafe Image Synthesis Attack [13.799517170191919]
Recent research has shown that safety checkers have vulnerabilities against adversarial attacks, allowing them to generate Not Safe For Work (NSFW) images.<n>We propose SC-Pro, a training-free framework that easily defends against adversarial attacks generating NSFW images.
arXiv Detail & Related papers (2025-01-09T16:43:21Z)
AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion models [33.29825481203704]
AdvI2I is a novel framework that manipulates input images to induce diffusion models to generate NSFW content.<n>By optimizing a generator to craft adversarial images, AdvI2I circumvents existing defense mechanisms.<n>We show that both AdvI2I and AdvI2I-Adaptive can effectively bypass current safeguards.
arXiv Detail & Related papers (2024-10-28T19:15:06Z)
Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step [65.1882845496516]
We introduce a novel jailbreaking method called Chain-of-Jailbreak (CoJ) attack, which compromises image generation models through a step-by-step editing process.<n>Our CoJ attack method can successfully bypass the safeguards of models for over 60% cases.<n>We also propose an effective prompting-based method, Think Twice Prompting, that can successfully defend over 95% of CoJ attack.
arXiv Detail & Related papers (2024-10-04T19:04:43Z)
HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models [28.28898114141277]
Text-to-Image(T2I) models have achieved remarkable success in image generation and editing.<n>These models still have many potential issues, particularly in generating inappropriate or Not-Safe-For-Work(NSFW) content.<n>We propose HTS-Attack, a token search attack method.
arXiv Detail & Related papers (2024-08-25T17:33:40Z)
Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt [60.54666043358946]
This paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively. In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts.
arXiv Detail & Related papers (2024-06-06T13:00:42Z)
Adversarial Prompt Tuning for Vision-Language Models [86.5543597406173]
Adversarial Prompt Tuning (AdvPT) is a technique to enhance the adversarial robustness of image encoders in Vision-Language Models (VLMs) We demonstrate that AdvPT improves resistance against white-box and black-box adversarial attacks and exhibits a synergistic effect when combined with existing image-processing-based defense techniques.
arXiv Detail & Related papers (2023-11-19T07:47:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.