Related papers: VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models

VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models

URL: http://arxiv.org/abs/2602.20999v2
Date: Sun, 01 Mar 2026 18:32:59 GMT
Title: VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models
Authors: Bowen Zheng, Yongli Xiang, Ziming Hong, Zerong Lin, Chaojian Yu, Tongliang Liu, Xinge You,
Abstract summary: Image-to-Video (I2V) generation models, which condition video generation on reference images, have shown emerging visual instruction-following capability.<n>We propose Visual Instruction Injection (VII), a training-free and transferable jailbreaking framework that disguises the malicious intent of unsafe text prompts as benign visual instructions in the safe reference image.<n>VII achieves Attack Success Rates of up to 83.5% while reducing Refusal Rates to near zero, significantly outperforming existing baselines.
Score: 57.128876964730644
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Image-to-Video (I2V) generation models, which condition video generation on reference images, have shown emerging visual instruction-following capability, allowing certain visual cues in reference images to act as implicit control signals for video generation. However, this capability also introduces a previously overlooked risk: adversaries may exploit visual instructions to inject malicious intent through the image modality. In this work, we uncover this risk by proposing Visual Instruction Injection (VII), a training-free and transferable jailbreaking framework that intentionally disguises the malicious intent of unsafe text prompts as benign visual instructions in the safe reference image. Specifically, VII coordinates a Malicious Intent Reprogramming module to distill malicious intent from unsafe text prompts while minimizing their static harmfulness, and a Visual Instruction Grounding module to ground the distilled intent onto a safe input image by rendering visual instructions that preserve semantic consistency with the original unsafe text prompt, thereby inducing harmful content during I2V generation. Empirically, our extensive experiments on four state-of-the-art commercial I2V models (Kling-v2.5-turbo, Gemini Veo-3.1, Seedance-1.5-pro, and PixVerse-V5) demonstrate that VII achieves Attack Success Rates of up to 83.5% while reducing Refusal Rates to near zero, significantly outperforming existing baselines.

Related papers

When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models [19.655310421085435]
We propose Vision-Centric Jailbreak Attack (VJA), the first visual-to-visual jailbreak attack.<n>VJA conveys malicious instructions purely through visual inputs.<n>We propose a training-free defense based on introspective multimodal reasoning.
arXiv Detail & Related papers (2026-02-10T18:59:55Z)
VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language [25.38940067963429]
Prior attacks on text-to-video (T2V) models typically add adversarial perturbations to obviously unsafe prompts.<n>We show that benign-looking prompts containing rich, implicit cues can induce T2V models to generate semantically unsafe videos.<n>We propose VEIL, a jailbreak framework that leverages T2V models' cross-modal associative patterns via a modular prompt design.
arXiv Detail & Related papers (2025-11-17T08:31:43Z)
VisualDAN: Exposing Vulnerabilities in VLMs with Visual-Driven DAN Commands [5.1114671756882535]
This work introduces VisualDAN, a single adversarial image embedded with DAN-style commands.<n>We prepend harmful corpora with affirmative prefixes to trick the model into responding positively to malicious queries.<n>Our results demonstrate that even a small amount of toxic content can significantly amplify harmful outputs once the model's defenses are compromised.
arXiv Detail & Related papers (2025-10-09T16:18:31Z)
Vid-Freeze: Protecting Images from Malicious Image-to-Video Generation via Temporal Freezing [2.48490797934472]
We introduce Vid-Freeze - a novel attention-suppressing adversarial attack that adds carefully crafted adversarial perturbations to images.<n>Our method explicitly targets the attention mechanism of I2V models, completely disrupting motion synthesis.<n>The resulting immunized images generate stand-still or near-static videos, effectively blocking malicious content creation.
arXiv Detail & Related papers (2025-09-27T12:26:34Z)
VCE: Safe Autoregressive Image Generation via Visual Contrast Exploitation [57.36681904639463]
Methods to safeguard autoregressive text-to-image models remain underexplored.<n>We propose Visual Contrast Exploitation (VCE), a novel framework that precisely decouples unsafe concepts from their associated content semantics.<n>Our experiments demonstrate that our method effectively secures the model, achieving state-of-the-art results while erasing unsafe concepts and maintaining the integrity of unrelated safe concepts.
arXiv Detail & Related papers (2025-09-21T09:00:27Z)
ShieldGemma 2: Robust and Tractable Image Content Moderation [63.36923375135708]
ShieldGemma 2, a 4B parameter image content moderation model built on Gemma 3.<n>This model provides robust safety risk predictions across the following key harm categories: Sexually Explicit, Violence & Gore, and Dangerous Content for synthetic images.
arXiv Detail & Related papers (2025-04-01T18:00:20Z)
Safe Text-to-Image Generation: Simply Sanitize the Prompt Embedding [16.188657772178747]
We propose Embedding Sanitizer (ES), which enhances the safety of text-to-image models by sanitizing inappropriate concepts in prompt embeddings.<n>ES is the first interpretable safe generation framework that assigns a score to each token in the prompt to indicate its potential harmfulness.
arXiv Detail & Related papers (2024-11-15T16:29:02Z)
Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step [65.1882845496516]
We introduce a novel jailbreaking method called Chain-of-Jailbreak (CoJ) attack, which compromises image generation models through a step-by-step editing process.<n>Our CoJ attack method can successfully bypass the safeguards of models for over 60% cases.<n>We also propose an effective prompting-based method, Think Twice Prompting, that can successfully defend over 95% of CoJ attack.
arXiv Detail & Related papers (2024-10-04T19:04:43Z)
TrojVLM: Backdoor Attack Against Vision Language Models [50.87239635292717]
This study introduces TrojVLM, the first exploration of backdoor attacks aimed at Vision Language Models (VLMs) TrojVLM inserts predetermined target text into output text when encountering poisoned images. A novel semantic preserving loss is proposed to ensure the semantic integrity of the original image content.
arXiv Detail & Related papers (2024-09-28T04:37:09Z)
Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models [58.74606272936636]
Text-to-image (T2I) diffusion models have shown exceptional capabilities in generating images that closely correspond to textual prompts.<n>The models could be exploited for malicious purposes, such as generating images with violence or nudity, or creating unauthorized portraits of public figures in inappropriate contexts.<n> concept removal methods have been proposed to modify diffusion models to prevent the generation of malicious and unwanted concepts.
arXiv Detail & Related papers (2024-06-21T03:58:44Z)
SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image Models [28.23494821842336]
Text-to-image models may be tricked into generating not-safe-for-work (NSFW) content. We present SafeGen, a framework to mitigate sexual content generation by text-to-image models.
arXiv Detail & Related papers (2024-04-10T00:26:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.