Related papers: VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models

VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models

URL: http://arxiv.org/abs/2510.17759v1
Date: Mon, 20 Oct 2025 17:12:10 GMT
Title: VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models
Authors: Qilin Liao, Anamika Lochab, Ruqi Zhang,
Abstract summary: We introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts.<n>We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks.<n>Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines.
Score: 19.867040067010674
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) extend large language models with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brittle templates, focus on single-attack settings, and expose only a narrow subset of vulnerabilities. To address these limitations, we introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts. This probabilistic view enables the generation of stealthy, coupled adversarial inputs that bypass model guardrails. We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks and providing distributional insights into vulnerabilities. VERA-V further integrates three complementary strategies: (i) typography-based text prompts that embed harmful cues, (ii) diffusion-based image synthesis that introduces adversarial signals, and (iii) structured distractors to fragment VLM attention. Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines on both open-source and frontier VLMs, achieving up to 53.75% higher attack success rate (ASR) over the best baseline on GPT-4o.

Related papers

When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models [75.16145284285456]
We introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings.<n>We develop the first automatically crafted and semantically guided prompting framework.<n> Experiments on the LIBERO benchmark reveal that even minor multimodal perturbations can cause significant behavioral deviations.
arXiv Detail & Related papers (2025-11-20T10:14:32Z)
Sequential Comics for Jailbreaking Multimodal Large Language Models via Structured Visual Storytelling [11.939828002077482]
Multimodal large language models (MLLMs) exhibit remarkable capabilities but remain susceptible to jailbreak attacks.<n>We introduce a novel method that leverages sequential comic-style visual narratives to circumvent safety alignments in state-of-the-art MLLMs.<n>Our approach achieves an average attack success rate of 83.5%, surpassing prior state-of-the-art by 46%.
arXiv Detail & Related papers (2025-10-16T18:30:26Z)
Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models [20.99874786089634]
Previous jailbreak attacks often inject malicious instructions from text into less aligned modalities, such as vision.<n>We propose a novel implicit jailbreak framework termed IJA that stealthily embeds malicious instructions into images via at least significant bit steganography.<n>On commercial models like GPT-4o and Gemini-1.5 Pro, our method achieves attack success rates of over 90% using an average of only 3 queries.
arXiv Detail & Related papers (2025-05-22T09:34:47Z)
Robustifying Vision-Language Models via Dynamic Token Reweighting [28.675118345987887]
Large vision-language models (VLMs) are highly vulnerable to jailbreak attacks.<n>We present a novel inference-time defense that mitigates multimodal jailbreak attacks.<n>We introduce a new formulation of the safety-relevant distributional shift induced by the visual modality.
arXiv Detail & Related papers (2025-05-22T03:00:39Z)
T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models [88.63040835652902]
Text to video models are vulnerable to jailbreak attacks, where specially crafted prompts bypass safety mechanisms and lead to the generation of harmful or unsafe content.<n>We propose T2VShield, a comprehensive and model agnostic defense framework designed to protect text to video models from jailbreak threats.<n>Our method systematically analyzes the input, model, and output stages to identify the limitations of existing defenses.
arXiv Detail & Related papers (2025-04-22T01:18:42Z)
MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks [85.3303135160762]
MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models.<n>It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines.<n>We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
arXiv Detail & Related papers (2025-03-24T20:38:42Z)
Distraction is All You Need for Multimodal Large Language Model Jailbreaking [14.787247403225294]
We propose Contrasting Subimage Distraction Jailbreaking (CS-DJ) to disrupt MLLMs alignment through multi-level distraction strategies.<n>CS-DJ achieves average success rates of 52.40% for the attack success rate and 74.10% for the ensemble attack success rate.<n>These results reveal the potential of distraction-based approaches to exploit and bypass MLLMs' defenses.
arXiv Detail & Related papers (2025-02-15T13:25:12Z)
White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models. Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input. An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z)
VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models [65.23688155159398]
Autoregressive Visual Language Models (VLMs) showcase impressive few-shot learning capabilities in a multimodal context. Recently, multimodal instruction tuning has been proposed to further enhance instruction-following abilities. Adversaries can implant a backdoor by injecting poisoned samples with triggers embedded in instructions or images. We propose a multimodal instruction backdoor attack, namely VL-Trojan.
arXiv Detail & Related papers (2024-02-21T14:54:30Z)
Adversarial Prompt Tuning for Vision-Language Models [86.5543597406173]
Adversarial Prompt Tuning (AdvPT) is a technique to enhance the adversarial robustness of image encoders in Vision-Language Models (VLMs) We demonstrate that AdvPT improves resistance against white-box and black-box adversarial attacks and exhibits a synergistic effect when combined with existing image-processing-based defense techniques.
arXiv Detail & Related papers (2023-11-19T07:47:43Z)
VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models [46.14455492739906]
Vision-Language (VL) pre-trained models have shown their superiority on many multimodal tasks. Existing approaches mainly focus on exploring the adversarial robustness under the white-box setting. We propose VLATTACK to generate adversarial samples by fusing perturbations of images and texts from both single-modal and multimodal levels.
arXiv Detail & Related papers (2023-10-07T02:18:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.