When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models
- URL: http://arxiv.org/abs/2602.10179v1
- Date: Tue, 10 Feb 2026 18:59:55 GMT
- Title: When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models
- Authors: Jiacheng Hou, Yining Sun, Ruochong Jin, Haochen Han, Fangming Liu, Wai Kin Victor Chan, Alex Jinpeng Wang,
- Abstract summary: We propose Vision-Centric Jailbreak Attack (VJA), the first visual-to-visual jailbreak attack.<n>VJA conveys malicious instructions purely through visual inputs.<n>We propose a training-free defense based on introspective multimodal reasoning.
- Score: 19.655310421085435
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in large image editing models have shifted the paradigm from text-driven instructions to vision-prompt editing, where user intent is inferred directly from visual inputs such as marks, arrows, and visual-text prompts. While this paradigm greatly expands usability, it also introduces a critical and underexplored safety risk: the attack surface itself becomes visual. In this work, we propose Vision-Centric Jailbreak Attack (VJA), the first visual-to-visual jailbreak attack that conveys malicious instructions purely through visual inputs. To systematically study this emerging threat, we introduce IESBench, a safety-oriented benchmark for image editing models. Extensive experiments on IESBench demonstrate that VJA effectively compromises state-of-the-art commercial models, achieving attack success rates of up to 80.9% on Nano Banana Pro and 70.1% on GPT-Image-1.5. To mitigate this vulnerability, we propose a training-free defense based on introspective multimodal reasoning, which substantially improves the safety of poorly aligned models to a level comparable with commercial systems, without auxiliary guard models and with negligible computational overhead. Our findings expose new vulnerabilities, provide both a benchmark and practical defense to advance safe and trustworthy modern image editing systems. Warning: This paper contains offensive images created by large image editing models.
Related papers
- VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models [57.128876964730644]
Image-to-Video (I2V) generation models, which condition video generation on reference images, have shown emerging visual instruction-following capability.<n>We propose Visual Instruction Injection (VII), a training-free and transferable jailbreaking framework that disguises the malicious intent of unsafe text prompts as benign visual instructions in the safe reference image.<n>VII achieves Attack Success Rates of up to 83.5% while reducing Refusal Rates to near zero, significantly outperforming existing baselines.
arXiv Detail & Related papers (2026-02-24T15:20:01Z) - Robustness of Vision Language Models Against Split-Image Harmful Input Attacks [4.937150501683971]
Vision-Language Models (VLMs) are now a core part of modern AI.<n>Recent work proposed several visual jailbreak attacks using single/ holistic images.<n>We introduce novel split-image visual jailbreak attacks (SIVA) that exploit this misalignment.
arXiv Detail & Related papers (2026-02-08T21:52:42Z) - Jailbreaks on Vision Language Model via Multimodal Reasoning [10.066621451320792]
We present a framework that exploits post-training Chain-of-Thought prompting to construct stealthy prompts capable of bypassing safety filters.<n>We also propose a ReAct-driven adaptive noising mechanism that iteratively perturbs input images based on model feedback.
arXiv Detail & Related papers (2026-01-29T23:09:24Z) - SafeGuider: Robust and Practical Content Safety Control for Text-to-Image Models [74.11062256255387]
Text-to-image models are highly vulnerable to adversarial prompts, which can bypass safety measures and produce harmful content.<n>We introduce SafeGuider, a two-step framework designed for robust safety control without compromising generation quality.<n>SafeGuider demonstrates exceptional effectiveness in minimizing attack success rates, achieving a maximum rate of only 5.48% across various attack scenarios.
arXiv Detail & Related papers (2025-10-05T10:24:48Z) - Robustifying Vision-Language Models via Dynamic Token Reweighting [28.675118345987887]
Large vision-language models (VLMs) are highly vulnerable to jailbreak attacks.<n>We present a novel inference-time defense that mitigates multimodal jailbreak attacks.<n>We introduce a new formulation of the safety-relevant distributional shift induced by the visual modality.
arXiv Detail & Related papers (2025-05-22T03:00:39Z) - Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense [90.71884758066042]
Large vision-language models (LVLMs) introduce a unique vulnerability: susceptibility to malicious attacks via visual inputs.<n>We propose ESIII (Embedding Security Instructions Into Images), a novel methodology for transforming the visual space from a source of vulnerability into an active defense mechanism.
arXiv Detail & Related papers (2025-03-14T17:39:45Z) - SC-Pro: Training-Free Framework for Defending Unsafe Image Synthesis Attack [13.799517170191919]
Recent research has shown that safety checkers have vulnerabilities against adversarial attacks, allowing them to generate Not Safe For Work (NSFW) images.<n>We propose SC-Pro, a training-free framework that easily defends against adversarial attacks generating NSFW images.
arXiv Detail & Related papers (2025-01-09T16:43:21Z) - AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion models [33.29825481203704]
AdvI2I is a novel framework that manipulates input images to induce diffusion models to generate NSFW content.<n>By optimizing a generator to craft adversarial images, AdvI2I circumvents existing defense mechanisms.<n>We show that both AdvI2I and AdvI2I-Adaptive can effectively bypass current safeguards.
arXiv Detail & Related papers (2024-10-28T19:15:06Z) - Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step [65.1882845496516]
We introduce a novel jailbreaking method called Chain-of-Jailbreak (CoJ) attack, which compromises image generation models through a step-by-step editing process.<n>Our CoJ attack method can successfully bypass the safeguards of models for over 60% cases.<n>We also propose an effective prompting-based method, Think Twice Prompting, that can successfully defend over 95% of CoJ attack.
arXiv Detail & Related papers (2024-10-04T19:04:43Z) - HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models [28.28898114141277]
Text-to-Image(T2I) models have achieved remarkable success in image generation and editing.<n>These models still have many potential issues, particularly in generating inappropriate or Not-Safe-For-Work(NSFW) content.<n>We propose HTS-Attack, a token search attack method.
arXiv Detail & Related papers (2024-08-25T17:33:40Z) - Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt [60.54666043358946]
This paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively.
In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts.
arXiv Detail & Related papers (2024-06-06T13:00:42Z) - Adversarial Prompt Tuning for Vision-Language Models [86.5543597406173]
Adversarial Prompt Tuning (AdvPT) is a technique to enhance the adversarial robustness of image encoders in Vision-Language Models (VLMs)
We demonstrate that AdvPT improves resistance against white-box and black-box adversarial attacks and exhibits a synergistic effect when combined with existing image-processing-based defense techniques.
arXiv Detail & Related papers (2023-11-19T07:47:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.