Related papers: Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models

Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models

URL: http://arxiv.org/abs/2508.01741v1
Date: Sun, 03 Aug 2025 12:51:47 GMT
Title: Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models
Authors: Ruofan Wang, Xin Wang, Yang Yao, Xuan Tong, Xingjun Ma,
Abstract summary: Fine-tuning open-source Vision-Language Models (VLMs) creates a critical yet underexplored attack surface.<n>We introduce the Simulated Ensemble Attack (SEA), a novel grey-box jailbreak method.<n>SEA exploits inherited vulnerabilities from the base model, significantly enhancing transferability.
Score: 24.65236224895181
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fine-tuning open-source Vision-Language Models (VLMs) creates a critical yet underexplored attack surface: vulnerabilities in the base VLM could be retained in fine-tuned variants, rendering them susceptible to transferable jailbreak attacks. To demonstrate this risk, we introduce the Simulated Ensemble Attack (SEA), a novel grey-box jailbreak method in which the adversary has full access to the base VLM but no knowledge of the fine-tuned target's weights or training configuration. To improve jailbreak transferability across fine-tuned VLMs, SEA combines two key techniques: Fine-tuning Trajectory Simulation (FTS) and Targeted Prompt Guidance (TPG). FTS generates transferable adversarial images by simulating the vision encoder's parameter shifts, while TPG is a textual strategy that steers the language decoder toward adversarially optimized outputs. Experiments on the Qwen2-VL family (2B and 7B) demonstrate that SEA achieves high transfer attack success rates exceeding 86.5% and toxicity rates near 49.5% across diverse fine-tuned variants, even those specifically fine-tuned to improve safety behaviors. Notably, while direct PGD-based image jailbreaks rarely transfer across fine-tuned VLMs, SEA reliably exploits inherited vulnerabilities from the base model, significantly enhancing transferability. These findings highlight an urgent need to safeguard fine-tuned proprietary VLMs against transferable vulnerabilities inherited from open-source foundations, motivating the development of holistic defenses across the entire model lifecycle.

Related papers

Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses [4.706534644850809]
Two primary inference-phase threats are token-level and prompt-level jailbreaks.<n>We propose two hybrid approaches that integrate token- and prompt-level techniques to enhance jailbreak effectiveness across diverse PTLMs.
arXiv Detail & Related papers (2025-06-27T07:26:33Z)
Robustifying Vision-Language Models via Dynamic Token Reweighting [28.675118345987887]
Large vision-language models (VLMs) are highly vulnerable to jailbreak attacks.<n>We present a novel inference-time defense that mitigates multimodal jailbreak attacks.<n>We introduce a new formulation of the safety-relevant distributional shift induced by the visual modality.
arXiv Detail & Related papers (2025-05-22T03:00:39Z)
T2V-OptJail: Discrete Prompt Optimization for Text-to-Video Jailbreak Attacks [67.91652526657599]
We formalize the T2V jailbreak attack as a discrete optimization problem and propose a joint objective-based optimization framework, called T2V-OptJail.<n>We conduct large-scale experiments on several T2V models, covering both open-source models and real commercial closed-source models.<n>The proposed method improves 11.4% and 10.0% over the existing state-of-the-art method in terms of attack success rate.
arXiv Detail & Related papers (2025-05-10T16:04:52Z)
Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints [81.14852921721793]
This study aims to understand and enhance the transferability of gradient-based jailbreaking methods.<n>We introduce a novel conceptual framework to elucidate transferability and identify superfluous constraints.<n>Our method increases the overall Transfer Attack Success Rate (T-ASR) across a set of target models with varying safety levels from 18.4% to 50.3%.
arXiv Detail & Related papers (2025-02-25T07:47:41Z)
Retention Score: Quantifying Jailbreak Risks for Vision Language Models [60.48306899271866]
Vision-Language Models (VLMs) are integrated with Large Language Models (LLMs) to enhance multi-modal machine learning capabilities.<n>This paper aims to assess the resilience of VLMs against jailbreak attacks that can compromise model safety compliance and result in harmful outputs.<n>To evaluate a VLM's ability to maintain its robustness against adversarial input perturbations, we propose a novel metric called the textbfRetention Score.
arXiv Detail & Related papers (2024-12-23T13:05:51Z)
IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves [64.46372846359694]
IDEATOR is a novel jailbreak method that autonomously generates malicious image-text pairs for black-box jailbreak attacks.<n>Our benchmark results on 11 recently releasedVLMs reveal significant gaps in safety alignment.<n>For instance, our challenge set achieves ASRs of 46.31% on GPT-4o and 19.65% on Claude-3.5-Sonnet.
arXiv Detail & Related papers (2024-10-29T07:15:56Z)
Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks [0.0]
Large Vision-Language Models (LVLMs) have significantly advanced AI by excelling in vision-language tasks. Jailbreak attacks bypass safety protocols and cause the model to generate misleading or harmful responses. We propose Sim-CLIP+, a novel defense mechanism that adversarially fine-tunes the CLIP vision encoder by leveraging a Siamese architecture.
arXiv Detail & Related papers (2024-09-11T15:39:42Z)
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models [20.385314634225978]
We focus on a popular class of vision-language models (VLMs) that generate text outputs conditioned on visual and textual inputs.<n>We find that transferable gradient-based image jailbreaks are extremely difficult to obtain.
arXiv Detail & Related papers (2024-07-21T16:27:24Z)
White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models. Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input. An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z)
On Evaluating Adversarial Robustness of Large Vision-Language Models [64.66104342002882]
We evaluate the robustness of large vision-language models (VLMs) in the most realistic and high-risk setting. In particular, we first craft targeted adversarial examples against pretrained models such as CLIP and BLIP. Black-box queries on these VLMs can further improve the effectiveness of targeted evasion.
arXiv Detail & Related papers (2023-05-26T13:49:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.