Related papers: VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language

VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language

URL: http://arxiv.org/abs/2511.13127v1
Date: Mon, 17 Nov 2025 08:31:43 GMT
Title: VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language
Authors: Zonghao Ying, Moyang Chen, Nizhang Li, Zhiqiang Wang, Wenxin Zhang, Quanchen Zou, Zonglei Jing, Aishan Liu, Xianglong Liu,
Abstract summary: Prior attacks on text-to-video (T2V) models typically add adversarial perturbations to obviously unsafe prompts.<n>We show that benign-looking prompts containing rich, implicit cues can induce T2V models to generate semantically unsafe videos.<n>We propose VEIL, a jailbreak framework that leverages T2V models' cross-modal associative patterns via a modular prompt design.
Score: 25.38940067963429
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Jailbreak attacks can circumvent model safety guardrails and reveal critical blind spots. Prior attacks on text-to-video (T2V) models typically add adversarial perturbations to obviously unsafe prompts, which are often easy to detect and defend. In contrast, we show that benign-looking prompts containing rich, implicit cues can induce T2V models to generate semantically unsafe videos that both violate policy and preserve the original (blocked) intent. To realize this, we propose VEIL, a jailbreak framework that leverages T2V models' cross-modal associative patterns via a modular prompt design. Specifically, our prompts combine three components: neutral scene anchors, which provide the surface-level scene description extracted from the blocked intent to maintain plausibility; latent auditory triggers, textual descriptions of innocuous-sounding audio events (e.g., creaking, muffled noises) that exploit learned audio-visual co-occurrence priors to bias the model toward particular unsafe visual concepts; and stylistic modulators, cinematic directives (e.g., camera framing, atmosphere) that amplify and stabilize the latent trigger's effect. We formalize attack generation as a constrained optimization over the above modular prompt space and solve it with a guided search procedure that balances stealth and effectiveness. Extensive experiments over 7 T2V models demonstrate the efficacy of our attack, achieving a 23 percent improvement in average attack success rate in commercial models.

Related papers

Jailbreaks on Vision Language Model via Multimodal Reasoning [10.066621451320792]
We present a framework that exploits post-training Chain-of-Thought prompting to construct stealthy prompts capable of bypassing safety filters.<n>We also propose a ReAct-driven adaptive noising mechanism that iteratively perturbs input images based on model feedback.
arXiv Detail & Related papers (2026-01-29T23:09:24Z)
T2VAttack: Adversarial Attack on Text-to-Video Diffusion Models [67.13397169618624]
We introduce T2VAttack, a study of adversarial attacks on Text-to-Video (T2V) models from both semantic and temporal perspectives.<n>To achieve an effective and efficient attack process, we propose two adversarial attack methods: T2VAttack-S, which identifies semantically or temporally critical words in prompts and replaces them with synonyms via greedy search, and T2VAttack-I, which iteratively inserts optimized words with minimal perturbation to the prompt.
arXiv Detail & Related papers (2025-12-30T03:00:46Z)
Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio [63.18443674004945]
This work explores a content-centric threat: exploiting TTS systems to produce speech containing harmful content.<n>We present HARMGEN, a suite of five attacks organized into two families that address these challenges.
arXiv Detail & Related papers (2025-11-14T03:00:04Z)
VMDT: Decoding the Trustworthiness of Video Foundation Models [77.90980744982079]
We introduce VMDT, the first unified platform for evaluating text-to-video (T2V) and video-to-text (V2T) models.<n>Through our evaluation of 7 T2V models and 19 V2T models using VMDT, we uncover several significant insights.
arXiv Detail & Related papers (2025-11-07T19:56:00Z)
T2V-OptJail: Discrete Prompt Optimization for Text-to-Video Jailbreak Attacks [67.91652526657599]
We formalize the T2V jailbreak attack as a discrete optimization problem and propose a joint objective-based optimization framework, called T2V-OptJail.<n>We conduct large-scale experiments on several T2V models, covering both open-source models and real commercial closed-source models.<n>The proposed method improves 11.4% and 10.0% over the existing state-of-the-art method in terms of attack success rate.
arXiv Detail & Related papers (2025-05-10T16:04:52Z)
BadVideo: Stealthy Backdoor Attack against Text-to-Video Generation [37.055665794706336]
Text-to-video (T2V) generative models have rapidly advanced and found widespread applications across fields like entertainment, education, and marketing.<n>We observe that in T2V generation tasks, the generated videos often contain substantial redundant information not explicitly specified in the text prompts.<n>We introduce BadVideo, the first backdoor attack framework tailored for T2V generation.
arXiv Detail & Related papers (2025-04-23T17:34:48Z)
T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models [88.63040835652902]
Text to video models are vulnerable to jailbreak attacks, where specially crafted prompts bypass safety mechanisms and lead to the generation of harmful or unsafe content.<n>We propose T2VShield, a comprehensive and model agnostic defense framework designed to protect text to video models from jailbreak threats.<n>Our method systematically analyzes the input, model, and output stages to identify the limitations of existing defenses.
arXiv Detail & Related papers (2025-04-22T01:18:42Z)
HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models [28.28898114141277]
Text-to-Image(T2I) models have achieved remarkable success in image generation and editing.<n>These models still have many potential issues, particularly in generating inappropriate or Not-Safe-For-Work(NSFW) content.<n>We propose HTS-Attack, a token search attack method.
arXiv Detail & Related papers (2024-08-25T17:33:40Z)
Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt [60.54666043358946]
This paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively. In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts.
arXiv Detail & Related papers (2024-06-06T13:00:42Z)
Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models [10.70975463369742]
We present the Jailbreaking Prompt Attack (JPA)<n>JPA searches for the target malicious concepts in the text embedding space using a group of antonyms.<n>A prefix prompt is optimized in the discrete vocabulary space to align malicious concepts semantically in the text embedding space.
arXiv Detail & Related papers (2024-04-02T09:49:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.